Molecules as Documents of Evolutionary History By E$lile Zuckerkandl and Linus Pauling Gates and Crellin Uboratories of Chemistry California Institute of Technology Contribution No. 3041 1. The chemical basis for a molecular phylogeny. Of al 1 natural systems, living matter is the one whici-1, in the face of great transformations, preserves inscribed in its organization the largest amount of its own past history. Us i ng Hegel I s express i 00, we rnay say that there is no other system that is better iiaufgehoben" (constantly abolished and simultaneously preserved). We may ask the questions where in the now living systems the greatest amount of their past history has survived and how it can be extracted. At any level of integration, the amount of history preserved will be the greater, the greater the complexity of the elements at that level and the smaller the parts of the elements that have to be affected to bring about a significant change. Under favorable condi- tions of this kind, a recognition of many differences between two elements does not preclude the recognition of their similarity. One may classify molecules ihat occur in living matter into three categories according to ihe degree to which the specific information contained in an organism is reflected in them : (1) Semantophoret i c molecules or semant ides -- molecules that carry the information of the genes or a transcript thereof. The genes themselves are the primary semantides (1 inear "sense-carrying" units). Messenger-&.NA mol ecu1 es are secondary semant ides. Pol ypept ides, at 1 east most of them, are tert i ary semant ides. (2) Ep i semant i c mol ecu1 es -- molecules that are synthesized under the control of tertiary semant ides. Al 1 molecules bui 1 ir by enzymes in the absence of a template are in this class. They are called ly the of this lthough they do not express extensive episemantic because, a i nformat ion contained i nformat ion. in the semantides, they are a product (3) Asemant i c mol ecu1 es -- molecules that are not produced by the organism and therefore do not express, either directly or indirectly (except by their presence, to the extent that this presence reveals a specific mechanism of absorptiorl)tarly of the information that this organi sm contains. However the oryanism may ofteil use them, and may often modify them anabolically and thus change them into episemantic molecules to the extent of this modification. The same molecular species may be episemantic in one organism and asemantic in another. Vitamins constitute examples. Simple mo and oxygen also fall into this category. organism for any length of time are neve ecules such as phosphate ion Macromolecules found in an asemant i c, vi ruses excepted. Vi ruses and other liepi somes" (Wol lman and Jacob, 1959) are asemantic when present in the host cell in the vegetative, autonanous state; they are semantophoretic when integrated into the genone of the host. Products of catabolism are not included in this classification. During the enzymatic breakdown of molecules, information contained in enzymes is expressed, but instead of being manifested in both the reaction and the product, this information is manifested in the reaction on1 y. Since we are considering products, catabolites as such are non- existent with respect to the proposed classification. The relevance of molecules to evolutionary history decreases as one passes from semantides to asemantic molecules, although the latter -3- may represent quantitative or qua1 i tat ive characteri st i cs of groups. As such they are, however, unrel iable and uninformative. It is plain that asemantic molecules are not worthy 0.f consideration in inquiries about phylogenetic relationships. Neither can episemantic molecules furnish the basis for a universal phylogeny, for such molecules, if universal, are not variable (ATP), and, if variable, are not universal (starches). It appears however possible 2 priori that parts of the phylogenetic tree could be defined in terms of episemantic molecules. An attempt in this direction has been made for instance on the basis of carotenoids in differert groups of bacteria (cf. Goodwin, 19b2). It is characteristic of such studies that they need independent confirmation. Such independent confirmation may be obtained by direct or indirect studies of sernarltides. In relation to a number of organic molecules, such as vitamin 812, orgalrisms as far apart on the evolutionary scale as bacteria, flagellates, and higher vertebrates differ, not in that the compound is present or absent, requi red or not required, but in the prevalent "pattern of specificity" (Hutner, 1955). By this is meant the measure of functional effectiveness of compounds closely similar to but riot idel;tical with the one ihat is actually present. Thereby the di Fferernce of orgaili sms in relation to the organic molecules under consideration is reduced to differences in enzymes and, in the last analysis, to the difference irt primary structure of polypept ide chains. Because of ihis relationship to studies of setnant ides, it is possible that the establishment of different patterns of specificity is one of the best uses to which episemantic and asemantic molecules may be put in phylogenetic studies. -4- Whereas semantides are of three types only (DNA, RNA, polypeptides), episemantic molecules are of a great variety of types. Thei r interest for phylogeny is proportional to thei r degree of complexity. Pol ysacchar i des such as cellulose are large molecules, but thei r complexity is small because of the monotonous repeat of the same subunits. In fact, not only episemantic molecules, but also semantides vary in their degree of complexity. The complexity of semantides is largest in the case of large globular polypeptide chains and smallest in the case of structural proteins characterized by numerous repeats of simple sequences. There may be a region of overlap of semantides with the lowest degree of complexity and of episemantic molecules with the highest degree of complexity. The former, however, will still contain more information than the latter about the present and the past of the organ i sm. Indeed, episemantic molecules are mostly polygenic characters, in that enzymes controlled by several di st i net structural genes have to co1 laborat e in their synthes i s ; moreover, they express the information conta ned in the active centers of enzymes on1 y, and in no other enzymat i c region; and even then express this information ambiguously; i.e., p obably with con- siderable "degeneracy". There is thus a great loss of information as one passes f ram semant ides to ep i setnant i c mol ecu1 es. Incidental 1 y, one cannot yet be sure that al 1 pol ypept ides are semant ides. Some, especially among the small ones, but also among the large structural ones, may be le molecular weight (Tay ep i semant i c. of reproduci b the suspicion a template. Thus, it has not been possible to split glutenin into subunits that glutenin might not or and Cl uskey, 1362). This raises be produced as the transcript of - 5 - Because tertiary semantides (enzymes) with different primary structures can lead to the synthesis of identical episemantic molecules as long as the active enzymatic sites are similar, wrong infererces about phyl ogenet i c relationships may be drawn f tom the presence of identical or similar episemantic molecules in different organisms. Amylopectins in plant starches and animal glycogens are very similar, yet we may expect (a point it wi 11 be of interest to verify) that the amino-acid sequence of the enzymes responsible for the synthesis of these polysaccharides in animal and plant kingdoms is very different. Moreover a simi lar end-product, in the case of episemantic molecules, may be obtained by different pathways, so that not even the active sites of the enzymes involved need to be similar. The synthesis of nicotinic acid and that of tyrosine are carried out via different pathways in bacteria and in other orgaili sms (cf. Cohen, 1963). Therefore, the presence of these molecules in no way points to a phylogenetic relation- ship between bacteria and these other organisms. The number of possible historical backgrounds to the presence of a molecule synthesized by an organism wi 11 tend toward unity only as the number of enzymes involved in the synthesis of this molecule increases significantly. It is not 1 ikely that a whole pyramid of enzymatic actions has been built more than once or twice during evolution. This consideration imp1 ies that the best phylogeiletic characters among episemantic molecules are not just the most complex molecules but, among these, the ones that are bui It from the least complex asemantic molecules. The preceding discussion suggests that the most rational, universal, and informative molecular phylogeny will be built otl semantophoret i c mol ecu1 es al one. Evol ut ion, in these molecules, seems to proceed most frequently by the substitution of one single building stone out of, say 50 to 300 for polypeptides or, on the basis of a triplet code, 150 to 900 for the corresponding nucleic acids. Even these small changes can have profound consequences at higher levels of orgari i c i ntegrat i on, through an alteration of the established pattern of mol ecu1 ar interact i on. Therefore, in macromolecules of these types there is more history in the making and more history preserved than at any other single level of I n prev i ous commun Paul i ng and Zuckerkandl, biological integration. ications (Zuckerkandl and Paul ing, 19b2; 1963) we have discussed ways of gaining i nforma- tion about evolutionary history through the comparison of homologous polypeptide chains. This information is threefold: (1) the approximate time of existence of a molecular ancestor common to the chains that are being compared; (2) the probable amino-acid sequence of this ancestral chain; and (3) the 1 i nes of descent along which given charlges in ami no- acid sequence occurred. The first type of information is obtained in part through an assessment of the overall differences between homologous polypeptide chains. The second and third types of information are obtained through a comparison of individual amino-acid residues as found at homologous molecular sites. Our purpose was to spell out principles of how to extract evolutionary history from molecules, rather than to write any part thereof in its final form -- an attempt that would requi re more information than is presently available even in the case of hemoglobins. Beside the analysis of amino-acid sequence of a greater number OF homologous polypeptide chains, two other sources of knowledge will help in retracing the evolutionary history of molecules. One is a con- sideration of the genetic code, to assess whether the passage of one - I- character of sequence to another could have occurred in one step, to di scover possible intermediary states of sequence, and to evaluate to a better approximation than by the simple comparison of the amino-acid sequence of two homologous polypeptide chains the minimum number of mutational events that separate these two chains on the evolutionary seal e. A second i s the study of three dimensional molecular models, CO permit one to make predictions about the effects of particular sub- stitutions and, on the basis of the transitions allowed by the genetic code, to exclude some subst i tuants, as incompatible with the preservation of molecular function,from the lis-t of possible evolutionary intermediates. This cursory outline of methodology in chemical paleogenetics applies directly to the analysis of polypeptide chains only. Al though techniques are not yet available for a thorough investigation of sequence in other types of semanti des, it is of interest to examine Lhe relation- ship between the different types of semantides with respect to the i nformat i on they contai II. 2. Cryptic genetic polymorphism through isosemailtic substitution. For any one corresponding set of molecules, Che three scripts used by Ilature in semantophore;ic molecules, the DNA-, RNA- and pol y- peptide scripts, represent 1 argel y, but presumably not exactly, the same message. Errors of transcription (Paul ing, 1957) are presumably only a minor cause of this lack o-f congruence. In view of the "degeneracy" of the genetic code, many ami no acids appearing to be coded for by more than one type of codon (Weisblum et al., 1962; Jones and Ni renberg, lg62), one must assume that information is lost in the passage from secondary (RNA) -8- to tert iary (polypeptide) semantides. Moreover many primary semant ides may not be transcribed : there are s apparent1 y not expressed in polypept along these stretches may represent gnificant stretches of DNA that are de products, and the base sequence mportant documents about the history of the organism as well as its present organization and potentialities. The degeneracy of the gehetic code, then, leads one to predict the existence of isosemantic heterozygosity, namely of differences in base sequence in allelic stretches of DNA that do not lead to differences in amino-acid sequence in the corresponding polypeptide chains. The base sequence of a codon may be changed, but the "sense" of the "word", in terms of amino acids, may remain the same. The same inference has been drawn independently by Richard T. Jones (personal conmuni cat i on and reference). Eck (1963) proposes that one of the three letters of each codon, perhaps the middle letter, i s recogrii zed by transfer-RNA's only as "puri ne" 0 I- l'pyrimidine II;- L*,according to the bulk of the molecule rather than to its exact species. Thus shifts between adeni ne and guani ne or between cytosin and uracil in the middle letter of messenger-RNA codons will noi: be heeded by transfer-RNA. If this is so, one must distinguish two levels of crypticity. Some base substitutions will remain cryptic, unexpressed at the level of the polypeptide chains, but will be recognized at the level of transfer-RNA (secondary crypticity). Other base sub- stitutions will remain cryptic at the level of both the polypeptide chain and the transfer-RNA (primary crypticity). (A third, more superficial level of crypticity was often referred to in the past, namely cryptic amino-acid substitutions ;n polypeptides, substitutions that were supposed to actual ly exist, but not to be detected by available chemical -9- means,. ) According to Eck's code, primary crypricity should exist for every ami no acid, because of the pecul iar role tentatively attributed to the middle letter of the codon, and secondary cryptici ily should exist for eleven amino acids out of twenty. Amino acids that occur with high frequencies usually seem to have degenerate codes. The opportunities for isosemantic substitution and cryptic genetic polymorphism, even of the secondary type of crypticity, should therefore be very widespread indeed. As is we1 1 known, the abnormal human hemoglobins HbS and HbC differ from HbA in that a valyl residue replaces a glutamyl residue in the Q-chain of HbS at the sixth position from the amino-end, whereas a lysyl residue replaces the same gl utamyl residue in HbC (Ingram, 195i; Hunt and Ingram, 1958, 19%). According to the proposals for the genetic code made by Jukes (1962), by Wahba et al. (1963) and by Eck (1963), the shift from val ine to lysine is one of the rare ones that require three base pair substitutions in DNA and therefore, presumably, three mutational steps. If correct, this conclusion would render unlikely ihe hypothesis, previously formulated by one of us (Pauling, 1961), that HbC is derived f torn HbS rather than frcnn HbA. The three genetic codes, on the other hand, are compatible with a one-step transition between HbA and lib5 as well as between HbA and HbC. According to Eck's code - no< according to the other proposals for a genetic code mentioned above - the val ine of HbS and the lysine of HbC must however have derived from two distinct isosemantic codons for glutamic acid in HbA (Fig. 1). Whether or not Eckls code wi 11, in the end, be shown to be correct in this respect, we may accept it provisionally for the sake of this discussion. I ndeed, eve11 if HbS and HbC are not the products of mutat ions in two i sosemanti c - 10 - codons, other cases of this type are likeiy to be found in the future, If the situation is as represented In Fig. 1, the two isosemantic codons for giutamic acid (which are actually, according to Eck, resolvable i;)to four isosemantic triplets, AAG, AGG, UFIG and UGG) must be thought to have at one time been widespread in the human population, and may even today constitute a case of perhaps widely occurring cryptic genetic pol ymorph i sm. A saarch for it might be made in particular among individuals who appear to posses s different isoalleles of MBA, in the nanenclature of t tJBf-m 09571, namely different heritable relative levels of HbA production. Because ail transfer-RNA*s that would correspond Lo all possibilities of a degenerate cod8 might conceivably not be avallablc in axccss in cert,ain orgarii sms or in certain tissues of an organ1 sm, iso$Euilantic substitutions may lead to increased or decreased rates of pol ypept i de synthssi s, Thus there could exist an operator-independerit change in rate sf protein synthesis through base-pair substitutions in structural gei;as. Paui i r-9 ( ) and ltanc (1957) used to thi .jk that the existence of isoal leles is probably linked to cryptic substitutions, In terms of cryptic amino- acid substitutions, this hypothesis has ceased to be as likely as it appeared to be, at least as 8 very generally applicable exp1anatIon. Yet it may be reintroduced oil the basis of cryptic base sutistitutions in WA and RNA. 1 f the ECdrCerxxiS of SCXRB species of i sobernant ic `b.(.allsfer'-WA's, can affect the rate of synthesis of pOiyp8ptid8S, th8 synthesis of 8 single pcS;ipopeidc chain should no~ proceed al: coivsee~~t speed aloily the chit i c1 but so ts speak i n jerks, wl th sudden dec8lerations ax molecular titer; where `ihe codons happen noi to correspond LO species of trarrsfer- Fit3A;that are present in axcess. it0 evidence for or iigainst this effect is oviri iable to our krmwicdge. 0t-r~ amy ask tha qucc,tion whrtther cryptic isosemantic substirutlonr my offar a possible alternate explanation, bssi de those iii ready proposed, of LIwi8 l;ha~assemic inlrib itlot of ceftili4l heaoglobln chtiii: ye:;es. k single isosmantlc substitutio1i may form ii boti: le-neck ihI S~I,LIMU~ S, but ar, additive affect of such s;UbstitUtiOiis is al so possible. The hutiatr &ch%i n my be cm that i s universal 1 y "thalassmic" in this s%~se. This intor- pt-et;atior, of thalassemit or Qf the low rate of synthesis of &-chains would imply that ;mrml amounts of the corresponding tu%s~~~~ye~RP&~s are produced alid occupy a sig`ri f lcvnt p%i'crrttagf_of tJ:% avei fable ribosanal Ed si trs wi thou% 1eadir'ry to nltic;h synthesis. ' SI.I.I~ an effect, is not very likely, especjaliy in thalassemia. t ndeed, an extensive survey of the 1 i terature has showi heterozygotes for a-thalassemi a or Q-thal assemi a a "compensatory'i i ncrease of mean absolute amounts per ccl 1 of the chain synthesized under the control of the allel ic gene, whether this al lel ic gene be normal or structurally abnormal (unpublished). This observation suggests that more ribosomal sites have become available to messenger-RNA produced by a single allele than is the case when both alleles are normally active. t f correct, this interpretation would imply that the OU;PU~ of messenger- RNA by thalassemic hemoglobin geiles is actually reduced, and that the block of polypeptide synthesis in thalassemia is at the genie level rather than at the level that involves the action of transfer-RNA. On the other hand one may surmise that the low rate of synthesis of g-chains is correlated with a low genie output of messenger-RNA more probably than with low synthetic efficiency at the ribosomal level. Moreover, the hyporhetical effect on rate of synthesis Of isosemantic substitutions is not supported by Eck's code for hemoglobin S, if it is assumed that HbS has arisen from HbA. According to Eck, there n - 12 - is indeed only one codon for val ine recognizable by transfer-RNA, namely UYG (i.e.,UCG and UUG; Y stands for "pyrimidinel'). One can therefore not say, without resorting to an auxi 1 iary hypothesis, that the apparent slower relative rate of HbS synthesis as compared to HbA synthesis in HbA/HbS heterozygotes is perhaps due to the appearance of a codon whose corresponding transfer-RNA is preseni: in limiting amounts. The auxi 1 iary hypothesis is that a given kind of transfer-RNA is not entirely indifferent CO the exact chemical species of the central purine or pyrimidine in a codon. It is possible that the exact chemical species of i&o mw dddle "letter" , while without action on the coding, influences the rate of synthesis of the polypeptide. Howeve r, very recent evidence (Levere and Li chtman, 1963) suggests that the rate of synthesis of HbS may in reality not be inferior to that of HbA. The present status of these problems is uncertai nty. Although there is therefore no evidence in favor of considering isosemantic substitutions as a significant factor in the regulation of the rate of polypept ide syilthesi s, .the possibility is not ruled out arid should be kept in mind as furnishing a basis for Itano's idea, expressed here in a slightly modified way, that rate of sy,lthesis and structure, at the level of a given structural geiie, are intimately 1 inked (I cane, 1957). If isosemantic substitutions recogn ized by transfer-RNA actually exerted an effect on rate of polypeptide synthesis, one would expect natural selection to act quite strongly on such substitutions. If natural selec- tion did not act on the other postulated type of isosemal-#tic substitutions, those of "primary crypt ici ty", not recognized by transfer RNA, the occurrence of such substitutions would be random. It would be more probable that some effect is present and that natural selection acts here - 13 - al so. A possible effect of the exact chemical species of the presumed middle letter of a codon on rate of synthesi s has just been mentioned. Other effects might be considered. In particular, it has been shown that the frequency of crossing over is inversely related to the degree of heterozygosity in a chromosome pair (Stadler and Towe, 1962). Via this mechanism isosemantic substitutions of primary crypticity as we1 1 as those of secondary crypticity may have far reaching effects on population geriet i cs. One may also examine the possibility that isosemantic substitutions have some effect on evolutionary stabi 1 i ty. Benzer (1361) has pointed out that, since AT base pairs are held iogether much less strongly than GC base pai rs, a genetic region rich in AT pairs will tend to be more subject to substitution. By selecting for isosemantic triplets rich in GC and low i Ii AT content, an organism might reduce its mutation rate without changing the structure of any of its proteins. Whether such ali effect would be si gni f i cant enough to i nf 1 uence the rate of evol ut ion remains to be seen. 1 t wi 11 be of interest to compare the base composition of DNA from "living fossils" such as Lingula or Limulus with base composition in more rapidly evolving animals. Finally, isosemantic substitutions in those regions of DNA that carry out the function of operators (Jacob and MOnod, 1961) might well lead to a modification of the stereochemical relationship between the operators and the repressor molecules. Thereby such isosemantic substitutions would llave an effect on rate of polypeptide synthesis, distinct from the operator-independent effect discussed earlier. Due to i sosemant i c substitutions, there probably is more evolut ioiiary - 14 - history inscribed in the base sequence of nucleic acids than in the amino- acid sequence of corresponding polypeptide chains. By its implications, a degenerate code thus emphasizes the role of nucleic acids as "master mol ecu1 es" over pol ypept i des, (a role sti 11 doubted by some (Commoner, 1962)) ,even though polypeptides may interact with nucleic acids to regulate the rate of synthesis of both pol ypept ides and nucleic acids' . All the potentialities of an individual may be assumed to be inscribed in polypeptide chains that are actually synthesized, or could be synthesized,by the ccl 1s under certain ci rcumstances, and in the structures that control the actual and potential rates of this synthesis. Yet it appears conceivable, since equal rates of synthesis urnder the control of distinct but isosemantic codons are possible, chat the individual contains information, not only, as we know, beyond that which it actually uses for its realization, but even beyond that which defines its potential ities. This part of its "being", necessarily cryptic in terms of the phenotype, would at best be expressed on1 y in relation to the evolution of the species. - 15 - REFERENCES Benzer, S., Proc. Natl. Acad. Sci. 47, 403-415, 1961. Cohen, S. S., Science Qj, 1017-1026, 1963. Commoner, B., in : "Horizons in Biochemistry", M. Kasha and B. Pul lman, eds., Academic Press, New York, 1962, pp. 319-334. Eck, R. V., Science 140, 477-481, 1963. Goodwin, T. W., ill : "Conparat ive Biochemi st ry'l, M. Fl orki n and H. S. Mason, eds., Vol IV, Academic Press, 1962, pp. 643-675. Hunt, J. A. and Ingram, V. M., Nature 181, 1062-1063, 1958. Hunt, J. A. and I rig ram, V. M. , Nature l& 870-872, 1959. E+ner, $3. EL, in: "RdtcWd,~ s. H. Hutner - A. -ff,eds., Vol. 11, Academic Press, New York, 1355, I ngram, V. A. , Nature l&l, 704?&, -1961. I tano, H. A., Advances in Protein Chem. 12, 215-268, 1957. Jacob, F. and Monod, J., Cold Spring Harbor Symposia Quant. Biol. 26, 193-211, 1961. Jones, 0. `~1. and Ni renberg, N. !J., Proc. Natl. Acad. Sci. 48, 2115-2123, 1961. Jones, R. T., in : llSympos i um on Foods : Proteins and thei r React ions", H. W. Schul tz and A. F. Anglemier, eds., The Avi Publishing Co., in press. Jukes, T. H., Proc. Natl. Acad. Sci. 48, 1809-1815, lgd2. Levere, A. D. and Li chtman, H. C., Blood 22, 334-341, 1963. Paul ing, L., in : "Arbeiten aus dem Gebiet der Naturstoffchemie. Festschrift Arthur Stall", Birkhiuser, Base1 1951, pp. 597-602. Paul i ng, L., Rudolf Virchow Memorial Lecture, New York, 1961; Proceedings of the Rudolf Virchow Medical Society,2irg3 (S. Karger AG,&sel). Paul ing, L. and Zuckerkandl, E., Acta Chem. Scand., I rpress. Stadler, D. K. and Towe, A. M., Genetics 47, 839-846, 1962. Taylor, N. W. and Cl uskey, J. E., Arch. Bi ochem. B i ophys. 97, 399-405, 1962. Wahba, A. J., Gardner, R. S., Basi 1 io, C., Mi 1 ler, A. s., Speyer, J. F. and Lengyel, P., Proc. Natl. Acad. Sci. 9, 116-122, 1963. Weisbl urn, B., Benzer, S. and Ho1 ley, R. W., Proc. Natl. Acad. Sci. 48, 1449-1454, 1962. - 16 - Wol lman, E. L. and Jacob, F., "La sexual ite' des bacte'ries", Masson E Cie, Paris, 1959, `PP. : 2479 Zuckerkandl, E. and Pauling, L., in : "Horizons in BiochemistryIt, M. Kasha and B. Pullman, eds., Academic Press, New York, 1962, pp. 189-225. 17 AcknoKledgrnent. On8 of the authors (&.L.) ia greatly indebted to Professor Joahua Lederbarg for discussiona about the topics treated in this paper. Fig. 1 The relation betv~een coUi;lg tripleta in hemoglobins A, S and C accordins to &ck's code. au val APG . UPG HbrC i lya APA . APU Possible one-s tep transitions are marked by doubls arrows. Xiddle lettera of codons : p = purine, Y = pyrimidine. Other symbola as usual (cf. Eck, 1963)