*******************************************************************;
* Survival time in weeks after diagnosis of AML (a form of leukemia).
* These data are the same as in `lr3vars.sas', except that we use a
*   Cox PH regression instead of AFT regressions.
*
* As before, we are interested in how survival time in weeks depends
*   on  (i) Presence or absence of `Auer rods' in microscopic
*   examination of distorted white blood cells  and
*    (ii) The Wbc count when AML was diagnosed.
*
* Since Wbc counts in affected individuals can be orders of magnitude
*   higher, we use logwbc=log10(Wbc) as a covariate instead of Wbc
*   itself.
*
* A complicating factor is that it appeared from the AFT regression
*   analysis that log(Wbc) had a negative effect on individuals with
*   Auer rods, but appeared to have no effect on individuals who had
*   no Auer rods (who were overall sicker).
*
* Note that the basic design is the same as that for an Analysis of
*   Covariance (ANCOVA) in linear models, in which a regression-like
*   continuous term is added onto a one-way ANOVA model. An ANCOVA
*   model has an INTERACTION if the regression slopes are not the
*   same for different treatment groups. This appears to be the case
*   here. (Here the `treatment groups' here are having/not having
*   Auer rods: This doesn't correspond to a `treatment' in any usual
*   sense, but `treatment group' is the standard term anyway.)
*
* In the AFT regression (lr3vars.sas), while regressions with just
*   Auer rod terms and regression slopes were highly significant,
*   the obvious analog of an ANCOVA model with interation had NO
*   COVARIATES significant. However, we were able to deduce what was
*   happening using a `stratified' model with simple regressions
*   on Logwbc within each treatment groups.
*
* From the output in `lr3vars.lst':
*  (i) Kaplan-Meier plots showed good separation between the two
*    groups, with the `NoRods' group having higher death rates,
*  (ii) the ls plot looked vaguely exponential, but the lls plot was
*    nonlinear, suggesting that the data wasn't a very good fit to
*    either a Weibull nor an exponential,
*  (iii) output from the generalized gamma model suggests that
*    (a) the data is closer to a log normal than to a Weibull,
*    (b) the best fit was not log-normal either, but
*    (c) it was not significantly different from a log normal.
*
* Fortunately, the statistical outcomes for the three models were
*   similar. However, we know from other examples that if the assumed
*   AFT error distribution is wrong, then coefficient estimates and
*   P-values can be wildly inaccurate.
*
* All of this suggests that we try a Cox regression, which makes
*   no assumptions about the parametric form of the survival
*   distribution.
*
* However, since the best-fitting parametric models do not have
*   proportional effects of covariates on hazards over time, we
*   should check the `proportion hazards' assumption. In other words,
*   we should check that the relative hazards due to lack of Auer
*   rods and the relative hazards due to high or low log(WBC) are
*   roughly constant over time.
*
* Data from Table 10.10, p271, of the 2nd edition of the text
*  (E.T.Lee, ``Statistical Methods for Survival Data Analysis'')
*  The 3rd edition discusses a somewhat similar example for ALL
*  (a different form of Leukemia) in Example 11.1, p266-269.
*  Note that this example has no censored values.
*
*******************************************************************;


title 'COX REGRESSION FOR AUERRODS and LOG(WBC) -- YOURNAME';
options ls=75 ps=60 pageno=1 nocenter;

title2 'SURVIVAL TIMES for initial white blood-cell count and Auer cell type';

*******************************************************************;
* Since previous output suggested that NoRods is the state with the
*   higher death rate, let's define havenorods=1 if rodtype='NoRods'
*   so that NoRods will have a risk factor > 1.
* (The only effect that this has on the model is that the sign of the
*   hazard coefficient of rodtype is reversed and the intercept term
*   may be different.)
*
* We use log10(WBC) (logs base 10) instead of log(WBC) (base e) to
*   make the logarithms more human-readable.
*******************************************************************;

data phwbc;
     retain rodtype havenorods;
     input zz$ wbc @@;
     if wbc=-1 then do;
        rodtype=zz;  havenorods=(rodtype='NoRods');  end;
     else do;
        weeks=input(zz,12.0);  Logwbc=log10(wbc);
	/* Define an ANCOVA-like interaction term to allow for */
	/*   different regression slopes in the two rod types */
	if havenorods=1 then LogIntwbc=Logwbc;
	else LogIntwbc=-Logwbc;  output;  end;
     drop zz wbc;   * For neatness;
datalines;
  NoRods  -1
   56   4400      65   3000     17  4000
    7   1500      16   9000     22  5300
    3  10000       4  19000      2  27000
    3  28000       8  31000      4  26000
    3  21000      30  79000      4  100000
   43  100000
  WithRods  -1
   65   2300      156    750      100  4300
  134   2600       16   6000      108  10500
  121  10000        4  17000       39  5400
  143   7000       56   9400       26  32000
   22  35000        1  100000       1  100000
    5   5200       65  100000
  ;

proc sort;
     * For clarity, sort by rodtype, then weeks, then Logwbc;
     by rodtype weeks Logwbc;
     run;

proc print;
     title3 'THE DATA AS SAS SEES IT';
     title4 'Note the different dependence on Logwbc in the two groups'; 
     run;

*******************************************************************;
* Let's jump in by doing Cox PH regressions as in an ANCOVA model:
*   First main effects only, for weeks on rodtype and log(wbc).
*   Second, main effects and Interaction, for weeks on rodtype, log(wbc),
*     and a between-group correction for the regression on log(wbc).
* As before, the regression on main effects is highly significant
*   with both main effects being significant.
* The main-effect-only output show that, for each increase factor of
*   10 in Wbc (i.e., LogWbc->Logwbc+1), the hazard rate goes up by a
*   factor of 2.13. Similarly, for any value of Wbc in this model,
*   the average relative risk of NoRods is 2.73.
* Curiously, the model with interaction says that there is no
*   interaction, which we know to be false from before.
* It also says that rodtype is not significant, but, since the
*   estimated regression slopes are different in the two groups, the
*   rodtype effect is an artifact of the difference between the
*   intercepts of the two slopes.
**************************************************************;

proc phreg data=phwbc;
     title3 'PHREG FOR WEEKS ON RODTYPE AND LOGWBC';
     model weeks = havenorods Logwbc;
     run;

proc phreg data=phwbc;
     title3 'PHREG FOR WEEKS ON RODTYPE, LOGWBC, AND INTERACTION';
     model weeks = havenorods Logwbc LogIntwbc;
     run;


*******************************************************************;
* In constrast, let's do Cox regressions for `stratified' models
*   on rodtype, by either 
*   (i) having completely separate slope parameters within each
*     treatment group, or by 
*   (ii) doing a separate Cox regression within each treatment group.
*
* Consistent with what we knew from the AFT regressions, this shows
*   that initial WBC count (log10(WBC)) has a significant effect on
*   risk for individuals with Rods, but no significant effect on
*   those with NoRods. Among individuals with Auer rods, the increase
*   in estimated relative risk for each factor of 10 in Wbc (or for
*   log10(Wbc)->log10(Wbc)+1) is 3.672 and 3.446, respectively, for
*   the two models. The estimates are not completely identical due
*   to nonlinearities in the Cox log likelihood.
**************************************************************;

data phwbc;
     set phwbc;
     if havenorods=1 then 
       do;  logwbcwithrods=0;  logwbcnorods=Logwbc;  end;
     else do;  logwbcwithrods=Logwbc;  logwbcnorods=0;  end;
     drop LogIntwbc;  * Not needed any more;
     run;

proc print;
     title3 'SEPARATE SLOPES FOR EACH TREATMENT: THE DATA AS SAS SEES IT';
     run;

proc phreg;
     title3 'COX PH REGRESSION with SEPARATE SLOPES for each rod type';
     model weeks = havenorods logwbcwithrods logwbcnorods;
     run;

proc phreg;
     title3 'STRATIFIED COX PH REGRESSION by rod type';
     by rodtype;
     model weeks = Logwbc;
     run;


**************************************************************;
* Let's test the proportional hazards assumption. We do this by
*   including, for each covariate, a new covariate that is the old
*   covariate multiplied by an increasing function of time. If the
*   new covariate has a slope that is significantly positive or
*   significantly negative, then the effects of the old covariate
*   must have been either positively or negatively correlated with
*   time. This means that the PH assumption is false for that
*   covariate. This gives us an easy way to test the PH assumption
*   for each covariate.
*
* We do this for one covariate at a time, since too many covariates
*   in any regression may cause the model to lose power for all
*   covariates.
*
* IN GENERAL, for time effects in a Cox regression, there are
*   TWO POSSIBLE NOTIONS OF TIMES in the Cox likelihood:
*
* The first (which we call FactorTime) is the death time of the
*   individual or individuals that define that Cox likelihood factor,
*   which is the same as the death times in the numerator of the Cox
*   likelihood. Since all individuals in the risk set at that time
*   were assumed to be competing at that time, then FactorTime must
*   be used as the time in the exponential risk sum in the denominator
*   of that Cox likelihood factor. Once that is done, and if the MLE
*   estimated coefficient of FactorTime times a covariate is
*   significantly positive or negative, then that covariate had an
*   effect on the risk that was positively or negatively correlated
*   with clock time. Note that to make this conclusion, the same time
*   (FactorTime) must be used in both the numerator and the the risk
*   set in the denominator for each factor in the Cox likelihood, even
*   though most of the exponential terms in the denominator in the
*   risk set are for records for individuals whose death or censoring
*   times may be far in the future, the idea being that they are
*   competing NOW and not at their actual death or censoring times.
*
* The second time (which we call RecordTime) is the death or censoring
*   time considered as a covariate for each individual in the data set.
*   If this is in the model, this appears not only in the numerator
*   (where it would belong), but also as a part of the record covariates
*   in all the terms in the risk sum in the denominator. Since
*   RecordTime for the exponential terms in the risk sum in the
*   denominator will usually be larger than FactorTime, RecordTime
*   cannot be used as a proxy for a time-dependent covariate.
*
* The derivation of the Cox likelihood permits time-dependent
*   covariates in the sense of FactorTime, which must then appear
*   in all terms (numerator and denominator) of each Cox likelihood
*   factor. In fact, RecordTime as a covariate makes no sense as a
*   time-dependent covariate acting on risk sets, but only as an
*   initially-measured covariate for each individual that can somehow
*   foretell the future for that individual.
*
* Note that FactorTime is that it is NOT a function of the records
*   in the dataset, but of the order of the Cox likelihood factors in 
*   which they are used. Thus we need an extension of `proc phreg' to
*   have time-dependent covariates depending on FactorTime. SAS allows
*   for this in a syntax that must be distinguished from the syntax
*   defining a new covariate depending on RecordTime. Using the new
*   syntax, SAS can allow for nearly any possible form for a
*   Cox-regression or logistic-regression likelihood with highly
*   modifiable behavior at each death (or failure) event.
*
* The first (correct) Cox regression below uses this syntax to define
*   a time-dependent covariate for Log(WBC) based on FactorTime, which
*   is the local clock for the factors in the Cox regression, as
*   opposed to the death or failure time of each exponential term in
*   the Cox likelihood, which is RecordTime.  This is the correct use
*   of time in this syntax. The second (incorrect) Cox regression
*   attaches the time-dependent covariate to individual records, which
*   means that the times are RecordTime and not FactorTime, and will
*   give a very difficult likelihood than is intended. In fact, using
*   time-dependent covariates defined in a dataset before `proc phreg'
*   means that the resulting model, rather than being a model of
*   covariates with time-dependent effects, is actually a model with a
*   covariate that (in general) could only be known by a psychic.
**************************************************************;

proc phreg;
     title3 'CHECKING TIME-DEPENDENCE OF RELATIVE LOGWBC HAZARD';
     title4 'USING THE CORRECT TIME (FACTORTIME)';
     title5 'STRATIFIED COX REGRESSION FOR EACH ROD TYPE';
     by rodtype;
     factortime = weeks*Logwbc;
     model weeks = Logwbc factortime;
     run;

**************************************************************;
* We now reopen the dataset `phwbc' to add a time-dependent variable that
*   is proportion to Logwbc. This will lead to the incorrect use of
*   `RecordTime' in the risk set sums in the Cox likelihood denominators.
**************************************************************;

data phbad;
     set phwbc;
     recordtime=weeks*Logwbc;
     run;

proc phreg data=phbad;
     title4 "USING THE INCORRECT `RECORDIME' INSTEAD OF `FACTORTIME':";
     title5 'THIS OUTPUT IS INCORRECT: NOTE THE DIFFERENCE IN OUTPUT!';
     by rodtype; 
     model weeks = Logwbc recordtime;
     run;

**************************************************************;
* Next we test time dependence of the NoAuerRod hazard.
* This implies that NoRods has a risk that is relatively constant
*   over time.
**************************************************************;

proc phreg data=phwbc;
     title3 'CORRECT MODELING FOR CHECKING NO-ROD TIME-DEPENDENCE';
     factortime = weeks*havenorods;
     model weeks = havenorods logwbcwithrods logwbcnorods factortime;
     run;