How can one tell in which direction evolution is going?

Stanley Sawyer

In the long run, virtually all biologists believe that the most important changes in organisms are due to the replacements of genes by new genes that do a better job for the organism.

However, many biologists believe that in large, established populations, most evolutionary change is, in contrast, due to the replacement of genes by slightly deleterious variants. The reason for this is that most mutations are harmful rather than helpful, and that mildly harmful mutations can become established in a population (and replace the former, better variant) due to the chance effects of who mates with whom and who happens to survive. This process would take a long time for a large population, but most evolutionary change takes place on a long time scale. These chance effects could not establish a severely damaged gene that was important to its host, but could replace a good gene by a gene that was only slightly worse.

In this view, most evolution in large populations is downhill. Any improvement in the population is due to either (i) very rare mutations to significantly better genes, which then spread through the population very quickly, at which point the population begins to move downhill again from a higher plateau. Alternatively, (ii) the entire large population can be replaced by the descendants of an isolated small population. In this scenario, several new favorable mutations, or a whole family of favorable mutations, become established in the small population due to inbreeding and the chance effects of mating and survival in a small population. Current biological thinking since the 1920s is that (ii) is more likely than (i), at least for major changes.

An argument for either (i) are (ii) is that, in the fossil record, creatures appear not to change for long periods and then suddenly a noticeably different creature appears. The new creature is presumably doing a better job in the same habitat than the creature it replaced, but might conceivably be no better than the first creature before it started going downhill. For shorter time periods (millions of years rather than tens or hundreds of millions of years), there is not enough fossil evidence to tell whether evolutionary change is continuous or else comes in bursts.

The most reliable and easiest to analyze biological information is from DNA. Unfortunately, DNA older than around 10,000 years is very rare and its use is still controversial. Thus we are led to try to answer questions about historical trends on the basis of the distribution of DNA in contemporary populations. This can be done: The distribution of a set of mutations within a population is different if the mutations are advantageous, deleterious, or selectively the same as the original variants. For example, advantageous or deleterious mutant genes that are present in a sample will be less common in the sample than if they had no significantly different effect on their host. This is because it is more difficult for deleterious genes to become common, and advantageous genes will tend to become established or nearly established as soon as they become common. There are also subtle differences in the distributions of advantageous as opposed to deleterious mutant changes. One can also use the number of established differences between two related species to gain additional information.

The basic data that was used consists of a sample of DNA sequences from one gene in one species (say, `m' sequences) and `n' DNA sequences from the same gene in a closely related species. This data is from two species of a fruit fly, Drosophila, and from two species of a common weed, Arabidopsis.

One also needs a statistical model for the sample frequencies of DNA changes as a function of mutation rates and amounts of selection. One then applies the statistical model to the sequence data and estimates parameters along with measures of statistical confidence of the parameter estimates.

Unfortunately, even a large number of sequences from a single genetic locus or type of gene does not have sufficient statistical power, so that one needs sequence data from the two species from many different genetic loci or types of genes (for example, 34 loci or 54 loci). Classical statistical methods (``maximum likelihoods'') are not well behaved for this much data of this type. A newer statistical method called Markov Chain Monte Carlo (MCMC) is effective and does produce results. The disadvantage is that MCMC methods can take many hours of computer time on a fast computer as opposed to milliseconds for classical statistical methods, but classical statistical methods do not work in this case.

Some references are

1. Sawyer, S. A. and D. L. Hartl (1992) Population genetics of polymorphism and divergence. Genetics 132, 1161--1176. PDF file

(This derives the basic statistical model: General speaking, biology journals do not like mathematical derivations, but this one allowed us to put a mathematical proof in an Appendix.)

2. Hartl, D. L., E. N. Moriyama, and S. A. Sawyer (1994) Selection intensity for codon bias. Genetics 138, 227--234.

(This applies the same theory to a slightly different problem, namely the tendency for different DNA variants to show the effects of selection even though they produce exactly the same gene product.)

3. Bustamante, Carlos, Rasmus Nielsen, Stanley A. Sawyer, Kenneth M. Olsen, Michael D. Purugganan, and Daniel L. Hartl (2002) The cost of inbreeding in Arabidopsis. Nature 416, 531--534. PDF file

(This applies the MCMC theory to a simple model of selection in which all new mutations of genes of a particular type (for example, for a particular enzyme) are either (i) immediately lethal or nearly lethal, and so can be ignored, or else (ii) have exactly the same selective advantage or disadvantage. This model is not realistic, but the single estimated selection coefficient for a particular gene might be an average selection coefficient of some kind for that gene. The conclusion was that two Drosophila species appeared to be positively evolving but that two weedy species (Arabidopsis) were going downhill.)

4. Sawyer, Stanley A, Rob J. Kulathinal, Carlos D. Bustamante, and Daniel L. Hartl (2003) Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection. Journal of Molecular Evolution 57, S154--S164. PDF file

(This paper generalizes the model in the previous reference so that, as the mutations occur, the selective advantages of nonlethal mutations are normally distributed with a mean that depends on the particular enzyme. The variance of the normal distributions is assumed to be the same for all loci. The model was applied to a subset of the Drosophila data.

One of the conclusions was that while only about 20% of new, nonlethal, mutations were beneficial, 48% of mutations that were polymorphic in the sample were beneficial, and 94% of mutations that became fixed in the entire population were beneficial. This suggests that the more pessimistic view of the downhill evolution of large populations is incorrect, at least for Drosophila.)

Last modified June 23, 2004