********************************************************; * Women's National Track Records by Country * Source: IAAF/ATFS Track and Field Statistics Handbook * for the 1984 Los Angeles Olympics * Best times for 7 different track events. * * Are there simple common factors? What do the important * principal components look like? * * Data is in the accompanying file `WomensTrack.dat' * * Notes on the SAS program: * (1) The `infile' statement below says to read data from a text * file instead of (for example) a datalines block. * WARNING: If the log file says that SAS cannot find the input * data text file, then you may need to include an explicit path * to the file. (See an example below for Windows PCs.) * (2) `firstobs=8' tells SAS to start reading the text file at * line 8. The first 7 lines of the text file are COMMENTS and * SHOULD NOT BE READ as input data. * (3) `length country $12' warns SAS that the names in this field * may be up to 12 characters. By default, SAS truncates text * fields at 8 characters. * * Data from Johnson&Wichern, ``Applied Multivariate Statistical * Analysis'', 5th ed, 2002, Table 1.9, p46. ********************************************************; title "WOMEN'S NATIONAL TRACK RECORDS BY COUNTRY - YOUR NAME"; title2 ' SEVEN FIELD EVENTS'; options nodate ls=75 ps=62 pageno=1 nocenter; data womenstrack; infile "WomensTrack.dat" firstobs=8; * infile "c:\\sasprogs\\math475\\WomensTrack.dat" firstobs=8; length country $12; input country$ m100 m200 m400 m800 m1500 m3000 Marathon; run; proc print; title3 "Data from the IAAF/ATFS Track and Field Statistics handbook"; title4 " for the 1984 Los Angeles Olympics:"; run; options ps=60; * To be safe; ********************************************************; * The first three principal components are easy to interpret, but * only the first PC (PRIN1) is obviously significant * Hint: PRIN3 separates countries that are relatively good (or bad) * in middle-distance events * A variety of PCA output is written to `statout', which will be * used to generate the scree plot. ********************************************************; proc princomp out=prindat outstat=statout; title3 "PRINCIPAL COMPONENTS ANALYSIS (CORRELATION-BASED)"; var m100--Marathon; run; ********************************************************; * The scatterplot of PRIN2*PRIN1 can be used to identify countries * by how they did in the Olympics. * If you used `proc gplot' instead of `proc plot', you could put * the country names by each point. ********************************************************; options ps=30; proc plot data=prindat; title4 "CORRELATION DATA: PRIN2*PRIN1"; plot prin2*prin1=country; run; options ps=60; ********************************************************; * List countries sorted by PRIN1 ********************************************************; proc sort data=prindat out=sortone; by prin1; run; proc print data=sortone; title4 "ASCENDING SORT BY PRIN1"; var country prin1 prin2 m100--Marathon; run; ********************************************************; * Generate the scree plot from SAS output ********************************************************; title3 'GENERATING A SCREE PLOT FROM PROC PRINCOMP OUTPUT'; proc print data=statout; title4 "PROC PRINCOMP DATA WRITTEN TO `OUTSTAT'"; run; data eigenrow; set statout; if _TYPE_ eq "EIGENVAL"; run; proc print data=eigenrow; title4 "THE EIGENVALUES AS A ROW VECTOR"; run; ********************************************************; * SAS procedure to write a new SAS dataset that is the transpose * (matrix) of another SAS dataset. * Variable names in the original dataset become the first column in * the new dataset, but need a SAS variable name (name=...) * The first column in the original dataset becomes the SAS variable * names in the new dataset. ********************************************************; proc transpose data=eigenrow out=screedata name=event; id _TYPE_; run; data screedata; set screedata; xx+1; * Short for xx=xx+1 beginning with x=0; run; proc print data=screedata; title4 "TRANSPOSED EIGENVALUE VECTOR AND DATA FOR SCREE PLOT"; run; options ps=30; proc plot data=screedata; title4 "SCREE PLOT"; plot EIGENVAL*xx = xx; run; options ps=60; ********************************************************; * In contrast, let's see what happens if we do PCA using the * covariance matrix instead of the correlation matrix: ********************************************************; proc princomp cov data=womenstrack; title3 "PRINCIPAL COMPONENTS USING THE COVARIANCE"; title4 "NOTE THAT OUTPUT IS DOMINATED BY MARATHON TIMES"; var m100--Marathon; run;