*******************************************************************; * Survival time in weeks after diagnosis of AML (a form of leukemia). * These data are the same as in `lr3vars.sas', except that we use a * Cox PH regression instead of AFT regressions. * * As before, we are interested in how survival time in weeks depends * on (i) Presence or absence of `Auer rods' in microscopic * examination of distorted white blood cells and * (ii) The Wbc count when AML was diagnosed. * * Since Wbc counts in affected individuals can be orders of magnitude * higher, we use logwbc=log10(Wbc) as a covariate instead of Wbc * itself. * * A complicating factor is that it appeared from the AFT regression * analysis that log(Wbc) had a negative effect on individuals with * Auer rods, but appeared to have no effect on individuals who had * no Auer rods (who were overall sicker). * * Note that the basic design is the same as that for an Analysis of * Covariance (ANCOVA) in linear models, in which a regression-like * continuous term is added onto a one-way ANOVA model. An ANCOVA * model has an INTERACTION if the regression slopes are not the * same for different treatment groups. This appears to be the case * here. (Here the `treatment groups' here are having/not having * Auer rods: This doesn't correspond to a `treatment' in any usual * sense, but `treatment group' is the standard term anyway.) * * In the AFT regression (lr3vars.sas), while regressions with just * Auer rod terms and regression slopes were highly significant, * the obvious analog of an ANCOVA model with interation had NO * COVARIATES significant. However, we were able to deduce what was * happening using a `stratified' model with simple regressions * on Logwbc within each treatment groups. * * From the output in `lr3vars.lst': * (i) Kaplan-Meier plots showed good separation between the two * groups, with the `NoRods' group having higher death rates, * (ii) the ls plot looked vaguely exponential, but the lls plot was * nonlinear, suggesting that the data wasn't a very good fit to * either a Weibull nor an exponential, * (iii) output from the generalized gamma model suggests that * (a) the data is closer to a log normal than to a Weibull, * (b) the best fit was not log-normal either, but * (c) it was not significantly different from a log normal. * * Fortunately, the statistical outcomes for the three models were * similar. However, we know from other examples that if the assumed * AFT error distribution is wrong, then coefficient estimates and * P-values can be wildly inaccurate. * * All of this suggests that we try a Cox regression, which makes * no assumptions about the parametric form of the survival * distribution. * * However, since the best-fitting parametric models do not have * proportional effects of covariates on hazards over time, we * should check the `proportion hazards' assumption. In other words, * we should check that the relative hazards due to lack of Auer * rods and the relative hazards due to high or low log(WBC) are * roughly constant over time. * * Data from Table 10.10, p271, of the 2nd edition of the text * (E.T.Lee, ``Statistical Methods for Survival Data Analysis'') * The 3rd edition discusses a somewhat similar example for ALL * (a different form of Leukemia) in Example 11.1, p266-269. * Note that this example has no censored values. * *******************************************************************; title 'COX REGRESSION FOR AUERRODS and LOG(WBC) -- YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; title2 'SURVIVAL TIMES for initial white blood-cell count and Auer cell type'; *******************************************************************; * Since previous output suggested that NoRods is the state with the * higher death rate, let's define havenorods=1 if rodtype='NoRods' * so that NoRods will have a risk factor > 1. * (The only effect that this has on the model is that the sign of the * hazard coefficient of rodtype is reversed and the intercept term * may be different.) * * We use log10(WBC) (logs base 10) instead of log(WBC) (base e) to * make the logarithms more human-readable. *******************************************************************; data phwbc; retain rodtype havenorods; input zz$ wbc @@; if wbc=-1 then do; rodtype=zz; havenorods=(rodtype='NoRods'); end; else do; weeks=input(zz,12.0); Logwbc=log10(wbc); /* Define an ANCOVA-like interaction term to allow for */ /* different regression slopes in the two rod types */ if havenorods=1 then LogIntwbc=Logwbc; else LogIntwbc=-Logwbc; output; end; drop zz wbc; * For neatness; datalines; NoRods -1 56 4400 65 3000 17 4000 7 1500 16 9000 22 5300 3 10000 4 19000 2 27000 3 28000 8 31000 4 26000 3 21000 30 79000 4 100000 43 100000 WithRods -1 65 2300 156 750 100 4300 134 2600 16 6000 108 10500 121 10000 4 17000 39 5400 143 7000 56 9400 26 32000 22 35000 1 100000 1 100000 5 5200 65 100000 ; proc sort; * For clarity, sort by rodtype, then weeks, then Logwbc; by rodtype weeks Logwbc; run; proc print; title3 'THE DATA AS SAS SEES IT'; title4 'Note the different dependence on Logwbc in the two groups'; run; *******************************************************************; * Let's jump in by doing Cox PH regressions as in an ANCOVA model: * First main effects only, for weeks on rodtype and log(wbc). * Second, main effects and Interaction, for weeks on rodtype, log(wbc), * and a between-group correction for the regression on log(wbc). * As before, the regression on main effects is highly significant * with both main effects being significant. * The main-effect-only output show that, for each increase factor of * 10 in Wbc (i.e., LogWbc->Logwbc+1), the hazard rate goes up by a * factor of 2.13. Similarly, for any value of Wbc in this model, * the average relative risk of NoRods is 2.73. * Curiously, the model with interaction says that there is no * interaction, which we know to be false from before. * It also says that rodtype is not significant, but, since the * estimated regression slopes are different in the two groups, the * rodtype effect is an artifact of the difference between the * intercepts of the two slopes. **************************************************************; proc phreg data=phwbc; title3 'PHREG FOR WEEKS ON RODTYPE AND LOGWBC'; model weeks = havenorods Logwbc; run; proc phreg data=phwbc; title3 'PHREG FOR WEEKS ON RODTYPE, LOGWBC, AND INTERACTION'; model weeks = havenorods Logwbc LogIntwbc; run; *******************************************************************; * In constrast, let's do Cox regressions for `stratified' models * on rodtype, by either * (i) having completely separate slope parameters within each * treatment group, or by * (ii) doing a separate Cox regression within each treatment group. * * Consistent with what we knew from the AFT regressions, this shows * that initial WBC count (log10(WBC)) has a significant effect on * risk for individuals with Rods, but no significant effect on * those with NoRods. Among individuals with Auer rods, the increase * in estimated relative risk for each factor of 10 in Wbc (or for * log10(Wbc)->log10(Wbc)+1) is 3.672 and 3.446, respectively, for * the two models. The estimates are not completely identical due * to nonlinearities in the Cox log likelihood. **************************************************************; data phwbc; set phwbc; if havenorods=1 then do; logwbcwithrods=0; logwbcnorods=Logwbc; end; else do; logwbcwithrods=Logwbc; logwbcnorods=0; end; drop LogIntwbc; * Not needed any more; run; proc print; title3 'SEPARATE SLOPES FOR EACH TREATMENT: THE DATA AS SAS SEES IT'; run; proc phreg; title3 'COX PH REGRESSION with SEPARATE SLOPES for each rod type'; model weeks = havenorods logwbcwithrods logwbcnorods; run; proc phreg; title3 'STRATIFIED COX PH REGRESSION by rod type'; by rodtype; model weeks = Logwbc; run; **************************************************************; * Let's test the proportional hazards assumption. We do this by * including, for each covariate, a new covariate that is the old * covariate multiplied by an increasing function of time. If the * new covariate has a slope that is significantly positive or * significantly negative, then the effects of the old covariate * must have been either positively or negatively correlated with * time. This means that the PH assumption is false for that * covariate. This gives us an easy way to test the PH assumption * for each covariate. * * We do this for one covariate at a time, since too many covariates * in any regression may cause the model to lose power for all * covariates. * * IN GENERAL, for time effects in a Cox regression, there are * TWO POSSIBLE NOTIONS OF TIMES in the Cox likelihood: * * The first (which we call FactorTime) is the death time of the * individual or individuals that define that Cox likelihood factor, * which is the same as the death times in the numerator of the Cox * likelihood. Since all individuals in the risk set at that time * were assumed to be competing at that time, then FactorTime must * be used as the time in the exponential risk sum in the denominator * of that Cox likelihood factor. Once that is done, and if the MLE * estimated coefficient of FactorTime times a covariate is * significantly positive or negative, then that covariate had an * effect on the risk that was positively or negatively correlated * with clock time. Note that to make this conclusion, the same time * (FactorTime) must be used in both the numerator and the the risk * set in the denominator for each factor in the Cox likelihood, even * though most of the exponential terms in the denominator in the * risk set are for records for individuals whose death or censoring * times may be far in the future, the idea being that they are * competing NOW and not at their actual death or censoring times. * * The second time (which we call RecordTime) is the death or censoring * time considered as a covariate for each individual in the data set. * If this is in the model, this appears not only in the numerator * (where it would belong), but also as a part of the record covariates * in all the terms in the risk sum in the denominator. Since * RecordTime for the exponential terms in the risk sum in the * denominator will usually be larger than FactorTime, RecordTime * cannot be used as a proxy for a time-dependent covariate. * * The derivation of the Cox likelihood permits time-dependent * covariates in the sense of FactorTime, which must then appear * in all terms (numerator and denominator) of each Cox likelihood * factor. In fact, RecordTime as a covariate makes no sense as a * time-dependent covariate acting on risk sets, but only as an * initially-measured covariate for each individual that can somehow * foretell the future for that individual. * * Note that FactorTime is that it is NOT a function of the records * in the dataset, but of the order of the Cox likelihood factors in * which they are used. Thus we need an extension of `proc phreg' to * have time-dependent covariates depending on FactorTime. SAS allows * for this in a syntax that must be distinguished from the syntax * defining a new covariate depending on RecordTime. Using the new * syntax, SAS can allow for nearly any possible form for a * Cox-regression or logistic-regression likelihood with highly * modifiable behavior at each death (or failure) event. * * The first (correct) Cox regression below uses this syntax to define * a time-dependent covariate for Log(WBC) based on FactorTime, which * is the local clock for the factors in the Cox regression, as * opposed to the death or failure time of each exponential term in * the Cox likelihood, which is RecordTime. This is the correct use * of time in this syntax. The second (incorrect) Cox regression * attaches the time-dependent covariate to individual records, which * means that the times are RecordTime and not FactorTime, and will * give a very difficult likelihood than is intended. In fact, using * time-dependent covariates defined in a dataset before `proc phreg' * means that the resulting model, rather than being a model of * covariates with time-dependent effects, is actually a model with a * covariate that (in general) could only be known by a psychic. **************************************************************; proc phreg; title3 'CHECKING TIME-DEPENDENCE OF RELATIVE LOGWBC HAZARD'; title4 'USING THE CORRECT TIME (FACTORTIME)'; title5 'STRATIFIED COX REGRESSION FOR EACH ROD TYPE'; by rodtype; factortime = weeks*Logwbc; model weeks = Logwbc factortime; run; **************************************************************; * We now reopen the dataset `phwbc' to add a time-dependent variable that * is proportion to Logwbc. This will lead to the incorrect use of * `RecordTime' in the risk set sums in the Cox likelihood denominators. **************************************************************; data phbad; set phwbc; recordtime=weeks*Logwbc; run; proc phreg data=phbad; title4 "USING THE INCORRECT `RECORDIME' INSTEAD OF `FACTORTIME':"; title5 'THIS OUTPUT IS INCORRECT: NOTE THE DIFFERENCE IN OUTPUT!'; by rodtype; model weeks = Logwbc recordtime; run; **************************************************************; * Next we test time dependence of the NoAuerRod hazard. * This implies that NoRods has a risk that is relatively constant * over time. **************************************************************; proc phreg data=phwbc; title3 'CORRECT MODELING FOR CHECKING NO-ROD TIME-DEPENDENCE'; factortime = weeks*havenorods; model weeks = havenorods logwbcwithrods logwbcnorods factortime; run;