****************************************************************; * Two-factor ANOVAs with NESTED effects: * * Suppose that you have a business that uses widgets as raw * material for a product that you manufacture and sell. The * widgets are currently purchased from several different * suppliers. You notice that your machines are constantly * breaking down, and find out that this is due to excessive * variability in the sizes of the widgets. If you can make sure * that your widgets are more uniform in size, then you can * adjust your machines to allow for this. * * The widget variability could be due to any of three reasons: * (1) differences between suppliers, in which case you could * restrict to one or two suppliers at a time, or else tell * each of your factories to purchases widgets only from the * the supplier that is closest, * (2) differences between average sizes for different batches * of widgets from the same same supplier. In that case, you * can test one or two widgets from each box and return boxes * with widgets whose average size is too large or too small, or * (3) variability within each batch of widgets, in which case * there may be no simple solution other either buying (or * designing) more robust machines or else negotiating with * your suppliers to provide more uniform widgets. * * In either case (2) or (3), you could promise to pay suppliers * more if their widgets are more uniform, but it would be * helpful to know the source of the variability in advance * before you begin negotiations. * * As a way to determine the source of the machine-damaging size * variation in the widgets, you select 3 suppliers and order * 4 boxes of widgets from each supplier at different times. * You select 4 widgets from each box for testing, for a total * of 3 x 4 x 4 = 48 widgets. * * This is a two-factor layout as in examples before, but of * a slightly different type. Here there are two factors with * factor levels: * * Supplier: * AcmeInd WHSupply GenWidget * Batch: * B11 B12 B13 B14 (from AcmeIndustrial) * B21 B22 B23 B24 (from WarehouseSupply) * B31 B32 B33 B34 (from GeneralWidget) * * Thus we have 12 different batches from 3 suppliers, each with * 4 widgets. Note that batches from different suppliers are * NOT LOGICALLY COMPARABLE. In particular, there is no logical * reason to have the same number of batches from each supplier. * * If each batch of widgets from each supplier happened to be a * different color, with (for example) batches of one each of red, * green, blue, and orange widgets from each supplier, then the data * would be a two-way layout as before with Supplier and Color as * factors. One could then test for main effects for Color and * Supplier as well as a Color*Supplier interaction, although that * is not exactly what we are interested in. In this case, we would * say that Supplier and Color are CROSSED factors, which basically * means that we could consider them as factors in a (balanced) * two-way layout. * * More precisely, two factors are CROSSED if you could have * observations for each cell: That is, for each possible * combination of levels from the two factors, you could have an * observation. In this case, Supplier and Batch are NOT crossed, * since there would be no way to have data about Batch B14 from * WHSupply or GeneralWidget. * * The opposite of crossed is NESTED. We say here that Batch is a * NESTED factor that is NESTED within Supplier if each level of * Batch only has data for one particular level of Supplier. * That is, we only have data for batch B13 from AcmeIndustrial, * and only have data for batch B32 from GeneralWidget. * * Of course, it is possible for a level (like B23) of one factor * (like Batch) to have data for a subset of Suppliers and another * level (like B31) to have data for a partially overlapping * set of Suppliers. In that case, the two factors are neither * Nested nor Crossed. * * A major difference between crossed and nested factors is that, * the interaction term makes no sense for nested factors. In * this case, there is no way that Batch could have additive or * nonadditive effects across Suppliers, since we have data for * each batch for only one supplier. Nevertheless, what the * interaction term would be if it could exist is the key to * the statistical analysis of nested effects. * * In general, suppose that a factor A (like Batch) is nested within * a second factor B (like Supplier). Each level B=b of factor B * has all of the data for a subset of the levels of factor A. * In this case, for any one Supplier, the model is that of a * one-way ANOVA where the treatment groups are batches. We can * then define * * SSA(b) = One-way Model ANOVA SS for levels of A within B=b * * and the total NESTED-EFFECT SS (Sum or Squares) as * * SSA(B) = Sum(all levels of B=b)SSA(b) * * This measures a tendency for averages of batches to deviate * from the means of their supplier. This is exactly variation * type (2) above. * * Suppose that we write the data as X_{abc} for 1 le b le k, * 1 le a le n_b, and 1 le c le n_ab, so that we have n_{ab} * widgets from the a-th batch from supplier b. Then we have * the identity * * Sum(a=1,n_b) Sum (c=1,n_{ab}) (X_{abc}-Xbar)^2 = * * = Sum(a=1,n_b) Sum (c=1,n_{ab}) (X_{abc}-Xbar_b)^2 * * + N_b (Xbar_b-Xbar)^2 * * where Xbar is the overall mean, Xbar_b is the mean for all * widgets for all batches from Supplier b and * N_b = Sum(a=1,n_b) n_{ab} is the total number of widgets from * that supplier. The second sum above can similarly be written * * Sum(a=1,n_b) Sum(c=1,n_{ab}) (X_{abc}-Xbar_{ab})^2 * * + Sum(a=1,n_b) n_ab (Xbar_{ab} - Xbar_b)^2 * * The first sum above is the contribution to the SSE for a * one-way ANOVA on cells for that supplier. The second sum above * is SSA(b). Thus the sum of the two expressions above is * SSE+SSA(B). This leads to the identity * * SSCtot = Sum(a,b,c) (X_{abc}-Xbar)^2 = * * = SSB + SSA(B) + SSE (1) * * If X_{abc} = mu + e_{abc} where e_{abc} are independent normal * with mean zero and the same variance, then the three terms on * the right-hand side of (1) are independent chi-square. This * is the ANOVA decomposition for two factors, one nested within * the other, and leads to F-tests for the first two sources of * variability above: * * (i) A main effect for B based on SSB/SSE * (ii) A nested effect for variability of levels of A * within the levels of B in which they are nested. This * test is based on SSA(B)/SSE. * * Since SSE comes from the sum of within-cells variances, this is * a full factorial model. Thus the model test for (1) is the same * as that for a one-way ANOVA on cells, which, in the case of two * factors A and B, with A nested within B, is also the one-way * ANOVA on the many different levels of A (here, 12 levels for * 12 total batches). * * If two-factor layout has two crossed factors A and B, then the the * full-factorial ANOVA is analyzed by * * SSCtot = SSA + SSB + SS(AxB) + SSE (2) * * Here SSA and SSB are the ``main effects'' for A and B and * SS(AxB) is the interaction effect. If B is nested within A, * then the corresponding decomposition is, as above, * * SSCtot = SSB + SSA(B) + SSE (3) * * It follows that, if one could make sense of the interaction AxB, * * SSA(B) = SSA + SS(AxB) (4) * * If each level of B has the same number of levels of A (in this * case, if there are an equal number of batches for each supplier), * then one could order the batches for each supplier arbitrarily * and consider two crossed factors, BatchOrder and Supplier. * Then (2) and (3) together imply (4) in the form * * SSBatch(Supplier) = SS(BatchOrder) + SS(SupplierxBatchOrder) * * where A=Batch (the nested major) and B=Supplier (the larger factor). * * Thus a test for Batch variability within supplier can be * decomposed into two parts, a Main Effect for BatchOrder and * an interaction between SupplierxBatchOrder. * * Unfortunately, this association can be done in (4!)^2=576 different * ways, and no one wants to look at output for 576 different ANOVAs. * However, SAS allows you to test the Main Effect (SSB) and * Nested Effect (SSA(B)) directly. * * As before, both effects are measured with respect to within-batch * variability (SSE), which defines error variance in any * full-factorial model. * * (The Batch and Supplier problem was adapted from * Montgomery, `Design and Analysis of Experiments, 1991, p443); * ****************************************************************; title 'BATCHES from 3 different vendors'; title2 'BATCHES ARE NESTED WITHIN VENDORS'; options ls=75 ps=60 nocenter; data widgets; retain Supplier; input xx$ @@; * Note the conditional read of 4 additional values ; * after reading a batch name; * Factors are Supplier and Batch. Values are Qual; if substr(xx,1,1)='B' then do; input val1-val4 @@; Batch=xx; Qual=val1; output; Qual=val2; output; Qual=val3; output; Qual=val4; output; end; else Supplier=xx; drop xx yy k val1-val4; datalines; AcmeInd B11 94 92 93 91 B12 91 90 89 88 B13 91 93 94 93 B14 94 97 93 93 WHSupply B21 94 91 90 92 B22 93 97 95 94 B23 92 93 91 90 B24 93 96 95 94 GenWidget B31 96 94 92 93 B32 90 92 94 91 B33 93 91 94 93 B34 95 94 93 92 ; ****************************************************************; * Note: The data step could be written in terms of arrays as * (ignoring semicolons) by replacing 3 of the lines above by * * input val1-val4 @@ Batch xx * array vv(*) val1-val4 * do k=1 to 4 Qual=vv(k) output end end ****************************************************************; proc print; title3 'THE DATA AS SAS SEES IT'; run; ****************************************************************; * Analysis of batch output as a NESTED EFFECT: ****************************************************************; proc glm; title3 'CORRECT NESTED GLM ANALYSIS'; title4 'A FULL-FACTORIAL MODEL WITH TWO EFFECTS'; classes Supplier Batch; model Qual = Supplier Batch(Supplier); run; ****************************************************************; * The output shows that the main problem is Batch within Supplier, * and not between-Supplier variability nor (relatively) * within-Batch variability. The values of the mean sums of * squares are * * Effect DF Mean SS P-value * ----------------------------------------------------------- * Error (within Batch) 36 2.2291667 ..... * Supplier (between Supplier) 2 3.39583333 0.2317 * Batch(within Supplier) 9 10.38194444 0.0004 * * Here within-Batch variability does not have a P-value, since * within-Batch variability is used as the denominator in F-tests. * * We next see what happens if you analyze Batch INCORRECTLY as a * CROSSED effect: * ****************************************************************; proc glm; title3 'INCORRECT CROSSED ANOVA ANALYSIS'; classes Supplier Batch; model Qual = Supplier Batch Batch*Supplier; run; ****************************************************************; * Note that the last (incorrect) analysis was for a very unbalanced * model, since `Batch' has 12 different levels distributed over * 3 Suppliers. * * * However, we can make Batch into a CROSSED effect by ordering the * Batches for each Supplier and considering BATCH ORDER as a * crossed effect. If we impose a Batch Order, then the levels of * `Batch' will be 1,2,3,4 instead of 1-12. Batch order might * correspond to the order in which we received them, or the color * of the boxes, or any other factor which has comparable values * across Suppliers. * * We can now test for the `main effect' of Batch (or Batch_order) * and even consider a `(Batch)Order x Supplier' interaction. This * may not make very much sense, but at least we can carry out the * calculations. * * In the following output, note that the value for * * SS(Order) + SS(Supplier*Order) * * is exactly the same as SSBatch(Supplier) in the previous output, * both in values and in the number of degrees of freedom. This * gives an experimental verification of (4) above. * * The first step is to physically assign 4 `Batch-Order' level * values the 12 batches. The easiest way to do this is to use the * 3rd letter in the Batch level name (e.g. 4 in B34). * ****************************************************************; data widgets; set widgets; Order='TX'; * A Template for T1 or T2 or T3 or T4; substr(Order,2,1)=substr(Batch,3,1); * T1 or T2 or T3 or T4; run; proc glm; title3 'CROSSED-EFFECT ANOVA ANALYSIS OF SUPPLIER AND BATCH-ORDER'; classes Supplier Order; model Qual = Supplier Order Supplier*Order; run; ****************************************************************; * The interaction was significant, as we might have expected: There * is significant between-Batch variability within Suppliers, and, * if Batches are assigned randomly to Batch order, then we might * expect a significant Suppler*BatchOrder interaction. To * illustrate, let's do the interaction plot: ****************************************************************; proc means nway noprint; classes Supplier Order; var Qual; output out=CellMeans mean=Qual; run; proc plot; plot Qual*Order=Supplier / vpos=30; run; ****************************************************************; * Example II. A two-factor ANOVA for coffee quality * * As a second example is which nesting is more obvious, let's assume * that coffee quality is tested at 3 different coffee shops in * each of 12 different cities. The 12 cities are located in * 3 different states. * * Is there a significant variation of coffee by state? By city? * By city within the states that contain them? * * This design has two factors with factor levels: * * State: Ohio NewYork Washington * * City: Columbus Cleveland Cincinnati Oxford Brooklyn * Buffalo Yonkers Albany Seattle Yakima Bellingham * Spokane * * We have one observation from each of 3 coffee shops within each * city, for a total of 12*3 = 36 observations. * * Note that each level of `City' only makes sense within a * particular state. This is in contrast with (for example) * Reading method and Sex in an earlier example: The values of Sex * for school children remain the same when a new reading method is * tried. However, `Spokane' no longer makes sense when one * leaves Washington State for New York or Ohio. * * In this case, City is NESTED within State, as opposed to being * CROSSED with State. One might force City to be a CROSSED effect * by only considering Cities that have the same name in different * states. (Most states seem to have a city named Oxford, and many * midwestern states have cities called Mexico or Peru.) However, * there is no reason to associate different cities with the same * name for the purpose of coffee tasting. * ****************************************************************; title 'COFFEE QUALITY in 12 cities in 3 states'; title2 'Which is significant: State, City within State, or Both?'; options ls=75 ps=60 pageno=1 nocenter; ****************************************************************; * SAS NOTE: States are entered as `Name -1 -1 -1' below so that we * can read four values and then test for val1<0. ****************************************************************; data coffee; retain State; length zz $15; * Tell SAS to allow for long city names; input zz$ val1-val3 @@; if val1<0 then State=zz; else do; City=zz; array vv(*) val1-val3; do k=1 to 3; Qual=vv(k); output; end; end; drop zz k val1-val3; datalines; Ohio -1 -1 -1 Columbus 104 93 98 Cleveland 92 104 98 Cincinnati 102 99 100 Oxford 112 109 106 NewYork -1 -1 -1 Brooklyn 104 100 102 Buffalo 106 108 113 Albany 104 104 98 Oneonta 108 100 95 Washington -1 -1 -1 Bellingham 109 108 101 Seattle 114 111 109 Spokane 100 98 101 Yakima 97 104 109 ; proc print; title3 'THE DATA AS SAS SEES IT'; run; ****************************************************************; * (Correct) analysis with City as a NESTED effect: * The parentheses in the model statement mean `nested within'. Here * `City(State)' means to sum `City' for fixed levels of `State' * over the different levels of `State'. * If `City' and `State' were crossed effects, then * City(State) = City + State*City; ****************************************************************; proc glm; title3 'NESTED GLM ANALYSIS'; classes State City; model Qual = State City(State); run; ****************************************************************; * In the nested analysis, City(State) was highly significant but not * State. The followed analyzes the Cities within each State, for * 3 different one-way ANOVAs: ****************************************************************; proc sort; by State City; run; proc glm; title3 'ANALYSES WITHIN STATES'; by State; class City; model Qual = City; means City / duncan; run;