***************************************************************; * CASE-CONTROL studies done by CONDITIONAL LOGISTIC REGRESSION * * Assume that we want to study the factors associated with a trait * or condition or disease. We separate possible factors into two * classes: factors that we think might be associated with the * condition (family history, diet, smoking, heart rate, blood type, * amount of exercise) and factors that may add variability to the * data but we think are not likely to be directly associated * with the condition (for example, age, sex, race, occupation, * location, etc). The second group are called ``nuisance factors'' * since they may add enough variability to the data so that * other factors are not statistically significant, but are not * likely to be related themselves. * * One approach to control for the effect of ``nuisance factors'' * is to select a sample of affected individuals that have the * trait (if the condition is rare, this might be all known cases) * and then, for each affected individual, choose one or more * unaffected individuals that match the affected individual on * the second set of factors (age, sex, race, occupation, etc). * This should reduce the noise caused by the unrelated factors. * The affected individuals are called the `cases'. The matched * unaffected individuals are the `controls'. For simplicity, * assume that we have one `control' for each `case'. * * Since the dependent variable is a trait (i.e. yes or no), this * suggests a logistic regression. However, we cannot use a standard * logistic regression, since the sample was chosen by the values of * the trait. More exactly, each case-control pair was chosen so * that one individual is affected (Y=1) and one is unaffected (Y=0) * based on matched traits, so that the case-control pairs cannot * be assumed to be a random sample. * * One way to handle this is CONDITIONAL LOGISTIC REGRESSION. This is * like a regular logistic regression, but is modified to handle * this situation. Suppose that a case-control pair of individuals * has (vectors of) covariates X0,X1. We model the conditional * probability * * P = Prob(Indiv.with X0 is affected, individ.with X1 is not * | Exactly one indiv. in the pair is affected) * * We use the definition of conditional probability * * P(A|B) = P(A and B)/P(B) = P(A)/P(B) * * for events A,B with A a subset of B. Suppose that the probability * that an individual with covariates X is affected is * * Prob(Y=1|X) = exp(a+bX)/(1 + exp(a+bX)) * * Then, if a pair of individuals have covariates X0 and X1, * * Prob(First affected, second not affected) * * = exp(a+bX0)/(1+exp(a+bX0)) times 1/(1+exp(a+bX1)) * * = exp(a+bX0) / ((1+exp(a+bX0))(1+exp(a+bX1))) * * and * * Prob(First not affected, second affected) * * = (1/(1+exp(a+bX0))) times exp(a+bX1)/(1+exp(a+bX1)) * * = exp(a+bX1) / ((1+exp(a+bX0))(1+exp(a+bX1))) * * Note that the denominators are the same in both cases. Hence * * Prob(Exactly one of the two is affected) * * = (exp(a+bX0) + exp(a+bX1)) / ((1+exp(a+bX0))(1+exp(a+bX1))) * * and by the rule P(A|B) = P(A and B)/P(B) * * L = Prob(Indiv.with X0 is affected | Exactly one is affected) * * = exp(a+bX0)/(exp(a+bX0) + exp(a+bX1)) * * This can be written in the form of a logistic probability as * * L = exp( b(X0-X1))/(1 + exp(b(X0-X1))) * * Assume that we have a set of n matched case-control pairs with * covariates (X0i,X1i). Under the condition that on exactly one * individual in each pair being affected, the probability that * it is always the `case' individual is the likelihood * * L = Prod_i L_i = Prod_i exp( b(X0i-X1i))/(1 + exp(b(X0i-X1i))) * * This is exactly the logistic regression likelihood for n (imaginary) * individuals (instead of the 2n individuals that we have) such that * * (i) the i-th individual has covariates X0i-X1i * (ii) all n imaginary individuals are affected (i.e, have Y=1) * (iii) the regression has no intercept term: That is, a=0 * * Thus, if we can do a logistic regression with no intercept, then we * can carry out a conditional logistic regression for matched * case-control pairs. * * Note that this is similar in spirit to a paired-sample t-test for * paired data (X_i,Y_i), such as measurements before and after * a treatment for n individuals. This is often done as a * one-sample t-test based on the differences Z_i=X_i-Y_i. * * We use these ideas to study possible predictive variables for * adult-onset or Type 2 diabetes (AODM) using data for 30 matched * pairs of individuals. * * Data from E.T.Lee & J.W.Wang, Statistical analysis for survival * data analysis, 3rd edn, J.Wiley & Sons, 2003, * (Table 14.9. p403-404) * * S. Sawyer - Washington University in St.Louis - November 9, 2005 ***************************************************************; title 'CASE-CONTROL LOGISTIC REGRESSION - Type 2 Diabetes - YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; data ccbmi; yy=1; * All 30 imaginary individuals are affected ; * Each line has data for one `case' then for one `control' ; * `bmi' is `body mass index': An index of body fat ; * `fam' (1,0 for Yes,No) is family history of AODM; * `phac' (1,0 for Yes,No) is physically active, not sedentary; * `sed' (1,0 for Yes,No) =1-phac for sedentary, not phys.active; /* Reading the data: */ input subj bmicase famcase phaccase bmictrl famctrl phacctrl; * Subtract data to get the covariates of the imaginary individual; /* Convert `physically active' to `sedentary' by changing the sign */ bmi=bmicase-bmictrl; fam=famcase-famctrl; sed=-(phaccase-phacctrl); * Labels for SAS procedures that support labels for variables:; label bmi='Body Mass Index difference' fam='Family history difference of AODM (1=Yes,0=No)' sed='Sedentary (No physical activity) difference (1=Yes,0=No)'; datalines; 1 22.1 1 1 26.7 0 1 2 31.3 0 0 24.4 0 1 3 33.8 1 0 29.4 0 0 4 33.7 1 1 26.0 0 0 5 23.1 1 1 24.2 1 0 6 26.8 1 0 29.7 0 0 7 32.3 1 0 30.2 0 1 8 31.4 1 0 23.4 0 1 9 37.6 1 0 42.4 0 0 10 32.4 1 0 25.8 0 0 11 29.1 0 1 39.8 0 1 12 28.6 0 1 31.6 0 0 13 35.9 0 0 21.8 1 1 14 30.4 0 0 24.2 0 1 15 39.8 0 0 27.8 1 1 16 43.3 1 0 37.5 1 1 17 32.5 0 0 27.9 1 1 18 28.7 0 1 25.3 1 0 19 30.3 0 0 31.3 0 1 20 32.5 1 0 34.5 1 1 21 32.5 1 0 25.4 0 1 22 21.6 1 1 27.0 1 1 23 24.4 0 1 31.1 0 0 24 46.7 1 0 27.3 0 1 25 28.6 1 1 24.0 0 0 26 29.7 0 0 33.5 0 0 27 29.6 0 1 20.7 0 0 28 22.8 0 0 29.2 1 1 29 34.8 1 0 30.0 0 1 30 37.3 1 0 26.5 0 0 run; proc print; title2 'THE DATA AS SAS SEES IT'; id subj; run; options ps=40; proc chart; title2 'VISUALIZING THE DATA: VERTICAL BAR CHARTS:'; title3 'POSITIVE VALUES ARE MORE COMMON AMONG THE AFFECTEDS'; vbar bmi fam sed; run; options ps=60; proc logistic; title2 'LOGISTIC REGRESSION WITH NO INTERCEPT'; model yy = bmi fam sed / noint; run; ***************************************************************; * The model is (just barely) significant, but no single covariate * is significant. Try again with a smaller model? ***************************************************************; proc logistic; title2 'LOGISTIC REGRESSION WITH STEPWISE MODEL REGRESSION'; title3 'SLE=0.10 for entry, SLS=0.10 for removal'; model yy = bmi fam sed / noint selection=stepwise sle=0.10 sls=0.10; run; ***************************************************************; * This selects BMI (only). Backwards stepwise selection also * selects BMI (only). Let's try BMI by itself: ***************************************************************; proc logistic; title2 'LOGISTIC REGRESSION FOR BMI ONLY WITH NO INTERCEPT'; model yy = bmi / noint; run; ***************************************************************; * BMI seems to have borderline significance, but family history * and possibly physical activity are suggestive. * * Try again with interactions? ***************************************************************; data ccbmi; set ccbmi; bmifam=bmi*fam; bmised=bmi*sed; famsed=fam*sed; run; ***************************************************************; * Note that `bmised' is BMI for sedentary individuals only, * and bmifam is BMI for individuals with family history only. * Forwards stepwise selection again picks BMI (only), but * backwards stepwise selection starting from six variables * is more imaginative and picks FAM BMISED: ***************************************************************; proc logistic; title2 'LOGISTIC REGRESSION WITH BACKWARDS STEPWISE REGRESSION'; title3 'ADDING THREE INTERACTIONS'; model yy = bmi fam sed bmifam bmised famsed / noint selection=backwards details sle=0.10 sls=0.10; run; ***************************************************************; * Let's try regressions on FAM BMISED and BMISED only: ***************************************************************; proc logistic; title2 'LOGISTIC REGRESSION FOR FAM BMISED WITH NO INTERCEPT'; model yy = fam bmised / noint; run; proc logistic; title2 'LOGISTIC REGRESSION FOR BMISED ONLY WITH NO INTERCEPT'; model yy = bmised / noint; run;