The Journal of Chemical Physics, 08 January 2005
J. Chem. Phys. 122, 024901 (2005) (18 pages)
Š2005 American Institute of Physics. All rights reserved.

Up: Issue Table of Contents
Go to: Previous Article | Next Article
Other formats: HTML (smaller files) | PDF (245 kB)

How effectivefor fold recognition is a potential of mean force thatincludes relative orientations between contacting residues in proteins?

Sanzo Miyazawa^a)

Faculty of Technology,Gunma University, Kiryu, Gunma 376-8515, Japan
Laurence H. Baker Center for Bioinformaticsand Biological Statistics, Plant Sciences Institute, Iowa State University, Ames,Iowa 50011-3020

Robert L. Jernigan^b)

Laurence H. Baker Center for Bioinformatics and Biological Statistics,Plant Sciences Institute, Iowa State University, Ames, Iowa 50011-3020
Department of Biochemistry,Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020

(Received: 16 August 2004; accepted: 1 October 2004; published online: 16 December 2004)

We estimate the statistical distribution ofrelative orientations between contacting residues from a database of proteinstructures and evaluate the potential of mean force for relativeorientations between contacting residues. Polar angles and Euler angles areused to specify two degrees of directional freedom and threedegrees of rotational freedom for the orientation of one residuerelative to another in contacting residues, respectively. A local coordinatesystem affixed to each residue based only on main chainatoms is defined for fold recognition. The number of contactingresidue pairs in the database will severely limit the resolutionof the statistical distribution of relative orientations, if it isestimated by dividing space into cells and counting samples observedin each cell. To overcome such problems and to evaluatethe fully anisotropic distributions of relative orientations as a functionof polar and Euler angles, we choose a method inwhich the observed distribution is represented as a sum of delta functions each of which represents the observed orientation ofa contacting residue, and is evaluated as a series expansionof spherical harmonics functions. The sample size limits the frequenciesof modes whose expansion coefficients can be reliably estimated. Highfrequency modes are statistically less reliable than low frequency modes.Each expansion coefficient is separately corrected for the sample sizeaccording to suggestions from a Bayesian statistical analysis. As aresult, many expansion terms can be utilized to evaluate orientationaldistributions. Also, unlike other orientational potentials, the uniform distribution isused for a reference distribution in evaluating a potential ofmean force for each type of contacting residue pair fromits orientational distribution, so that residue-residue orientations can be fullyevaluated. It is shown by using decoy sets that thediscrimination power of the orientational potential in fold recognition increasesby taking account of the Euler angle dependencies and becomescomparable to that of a simple contact potential, and thatthe total energy potential taken as a simple sum ofcontact, orientation, and ( phi , psi ) potentials performs well to identify thenative folds.Š 2005 American Institute of Physics.

I. INTRODUCTION

Forthe past ten years, there have been many attempts¹–³⁵ todevelop coarse-grained scoring potentials that can identify native structures fromnon-native folds.³⁶^,³⁷^,³⁸^,³⁹ These simplified potentials are useful in studies ofprotein structural prediction⁴⁰^,⁴¹^,⁴²^,⁴³ and protein dynamics and folding mechanism²⁸^,²⁹^,⁴⁴ becauseit is computationally difficult to use all-atom molecular dynamics simulationsfor these purposes.

The idea of using residue-residue contact frequenciesto represent contact preferences between amino acids was proposed firstby Tanaka and Scheraga,¹ and a contact potential²^,³^,⁴ for eachtype of amino acid pair at a residue level wasevaluated in the Bethe approximation under the assumption that proteinstructures can be regarded as a mixture of disconnected residuesin statistical equilibrium. Sippl⁸ introduced a distance dependency into apair potential and evaluated it as a potential of meanforce. Score functions at an atomic level were also devised.¹¹^,¹²^,¹³^,¹⁴^,¹⁸The capabilities of pairwise score functions to identify native structuresfrom non-native folds have been examined by those optimizations,¹⁹^,²⁰^,²¹^,²²^,²³^,²⁴^,²⁵ andit was reported that it is impossible to make apairwise potential²¹ and even a distance-dependent potential²³^,²⁴ to identify allnative structures. Multibody potentials have also been derived and theimportance of multibody interactions have been pointed out.²⁸^,²⁹^,³⁰^,³¹ Liwo et al.³²developed a general method to derive multibody terms in apotential of mean force.

On the other hand, the importanceof specific coordinations between residues in protein structures was pointedout by Bahar and Jernigan.⁴⁵ Liwo et al.¹⁵^,¹⁶ developed a united-residueforce field that is both radial and anisotropic. The united-residueforce field was determined by parameterizing physically reasonable functional formsof potentials of mean force for side chain interactions. Eachside chain was represented by an ellipsoid and the relativeorientation between side chains was described by three angles. Theinteractions between side chains were parameterized as van der Waalspotentials. Buchete et al.³⁴^,³⁵ also attempted to develop anisotropic statistical potentialsfrom the observed distribution of relative residue-residue orientations in knownprotein structures. To represent the orientation of one residue relativeto another, three degrees of translational freedom and three degreesof rotational freedom must be specified. A polar coordinate systemand Euler angles can be used to specify the threedegrees of translational freedom and the three degrees of rotationalfreedom, respectively. In their potentials, only radial distance and polarangle dependencies of relative residue-residue orientations are taken into accountbut Euler angle dependences of the orientations were not explicitlytaken into account, probably because of the limited size ofsamples. Onizuka et al.³³ attempted to estimate a fully anisotropic distance-dependentpotential, which is a function of radial distance, polar, andalso Euler angles, for each type of residue pair, althoughthey could not achieve any improvement in the discrimination powerof their score function by taking account of Euler angledependencies. These analyses indicate the importance of residue-residue orientations inresidue-residue interactions.

Here the fully anisotropic distributions of relative orientationsbetween contacting residues are estimated as a function of polarand Euler angles from known protein structures. Those Euler angledependencies and correlations between polar and Euler angles are analyzedas well as polar angle dependencies.

For evaluation of thefrequency distribution of residue-residue orientations, we did not use amethod of dividing space into many cells and counting samplesobserved in each cell, but instead employed the method proposedby Onizuka et al.³³ in which the observed distribution of residue-residueorientations is represented as a sum of delta functions eachof which represents the observed location in angular space, andthen is estimated in the form of a series expansionwith spherical harmonics functions, ignoring high frequency modes that occur,because of the sample size. High frequency modes are statisticallyless reliable than low frequency modes. Here, unlike other works³³^,³⁴^,³⁵each expansion term is separately corrected for the sample sizeaccording to suggestions from an analysis of Bayesian statistics. Asa result, many expansion terms can be utilized to evaluateorientational distributions. A local coordinate system for each residue isdefined for fold recognition, based only on main chain atomsto represent directional and rotational relationships between the main chainsof contacting residues rather than between the side chains.³³^,³⁴^,³⁵ Resultsshow that a large contribution to the orientational entropy ofresidue pairs comes from the Euler angle dependencies of thefrequency distribution and also from the polar and Euler anglecorrelations. Then, an energy potential for relative orientations of contactingresidues is evaluated for each type of amino acid pairas a potential of mean force from the estimated distributions.

A reference state is also defined differently from other works.³³^,³⁴^,³⁵A reference distribution for each type of amino acid pairis the uniform distribution rather than the overall distribution forall types of amino acid pairs employed by other works,³³^,³⁴^,³⁵so that residue-residue orientations can be fully evaluated. The overalldistribution may be one of the important characteristics to distinguishproteinlike structures from others, because the overall distribution observed innative structures is not known to be characteristic of non-nativeconformations. The zero energy level of the orientational potential foreach residue pair type is defined such that the expectedvalue of orientational energy for the native folds is equalto zero for each type of contacting residue pair. Therefore,this orientational potential represents simply the suitability of a givenrelative orientation between contacting residues. Also, this orientational potential canbe used without any modification as a scoring function foroptimum sequence designs and sequence-structure alignments in which deletions andadditions of amino acids are allowed.⁷

It is shown thatthe discrimination performance of the orientational potential in fold recognitionis significantly improved by taking account of Euler angle dependenciesand the performance of a total energy potential consisting ofa long-range contact potential and a short-range secondary structure potentialis improved by taking account of the orientational potential asan additional term.

II. METHODS

A. Coarse-grained conformationalenergy

A conformational potential, which will be used for fold recognition,is represented as the sum of coarse-grained long-range E^l andshort-range E^s potentials. The long-range potential has two terms, acontact energy E^c reflecting contact frequencies in crystal structures anda repulsive energy E^r to penalize overly dense packing

Theshort-range potential is a secondary structure potential based on peptidedihedral angles. All of these potentials are estimated as potentialsof mean force from the observed distributions of residue-residue contactsand of peptide dihedral angles at the residue level incrystal structures of proteins. In the following, energy is representedin k_BT units, where k_B is the Boltzmann constant andT is temperature.

B. Contact potentials

The total contact energy isdefined here as the sum of all pairwise energies betweenresidues,

where e^c(r_i,r_j) is the contact energy between the ithand jth residues, and r_i represents all the atomic positionsof the ith residue. The pairwise energy potential is representedas the sum of two terms, one of which isthe usual contact potential²^,³^,⁴ and the other is a potentialof mean force for relative orientations between contacting residues thatis evaluated here from the statistical distribution of relative orientations,

where Delta ^c(r_i,r_j) represents the degree of contact between the ithand jth residues, e aiajc is the contact energy for residuesof types a_i and a_j in contact, and e^o_{a_ia_j}(r_i,r_j) isthe orientational energy for the relative direction and rotation betweenamino acids of type a_i and a_j contact; a_i meansthe amino acid type of the ith residue. Here, itshould be noted that the radial distance between residues isdescribed by specifying whether or not these residues are incontact with each other, and that orientational interactions are assumedonly for residues that are in contact with each other.

Delta ^c(r_i,r_j) takes a value one for residues that are completelyin contact, the value zero for residues that are toofar from each other, and values between one and zerofor residues whose distance is intermediate between those two extremes,about 6.5 Ĺ between geometric centers of their side chainheavy atoms. Previously, this function was defined as a stepfunction for simplicity. Here, it is defined as a switchingfunction as follows; in the equation below to define residuecontacts, r_i means the position vector of a geometric centerof side chain heavy atoms or the C atom forGLY,

where S_w is a switching function, and r avdw isthe van der Waals radius of a residue of typea which is estimated from the average volume V_a occupiedby a residue of type a in protein structures withthe packing density of hard sphere rho ; V_a are thosecalculated in Refs. 46 and 47 and listed in Ref.2. A critical distance to define a residue-residue contact isabout 6.5 Ĺ, but it is taken to be largerfor bulky residues.

Pairwise contact energies are defined as thesum of collapse energy e rrc and a residue-type dependent term Delta e aa[prime]c ; r means an average residue here.

The energies Delta e aa[prime]c for all pairs of the 20 types of residues wererecalculated⁴⁴ from 2129 protein species representatives of the SCOP⁴⁸ Release1.53 with the sampling method³ and with the parameters evaluatedin Miyazawa and Jernigan⁴ to correct these values estimated bythe Bethe approximation; actually, the estimates of contact energies correctedfor the Bethe approximation are divided by alpha 0.263 defined inEq. (34) of that paper⁴ and used as the valuesof Delta e aa[prime]c . In other words, the intrinsic pairwise interaction energies delta e_ij are corrected relative to the hydrophobic energies Delta e_ir, andthe hydrophobic energies are not corrected at all; see thatpaper⁴ for the exact definitions of delta e_ij and Delta e_ir. Thisscheme is employed, so that all the energy potentials inEq. (1) have magnitudes estimated as the potential of meanforce from observed distributions by assuming a Boltzmann distribution.

Thecollapse energy e rrc is essential for a protein to foldby canceling out the large conformational entropy of extended conformationsbut it is difficult to estimate.²^,³ The value –2.55 originallyestimated²^,³ for e is used here; as a result, thecontact energy e aa[prime]c takes a negative value for all aminoacid pairs except for LYS-LYS pair.

C. Residue-residue orientational potential

Inthe representation of the relative location of one residue withrespect to another three degrees of translational freedom and threedegrees of rotational freedom must be specified. Here, distances betweenresidues are described by specifying whether or not those residuesare in contact with each other. Thus, for contacting residuepairs, two degrees of directional freedom and three degrees ofrotational freedom are needed to represent those relative locations. Letus use polar angles ( theta , phi ) and Euler angles ( Theta , Phi , Psi ) todescribe the direction and rotation of one residue relative toanother, respectively. A local coordinate system fixed on each residuewill be defined later. The potential of mean force forresidue orientations is defined as

where f_aa( theta , phi , Theta , Phi , Psi ) is a probabilitydensity function for a residue of type a at theorientation ( theta , phi , Theta , Phi , Psi ) relative to the residue of type a; itsatisfies

An obvious relationship between the Euler angles exists forthe distribution of residue orientations between f_aa and f_aa:

Therelationship in respect to the polar angles ( theta , phi ) is notsimple, but ( theta , phi ) can be uniquely calculated from ( theta , phi , Theta , Phi , Psi ). Thus,in principle, f_aa and f_aa must be equal to eachother:

However, in the present statistical estimation of the probabilitydensity, the relationship above would be approximately satisfied. Therefore, thepotential is evaluated in the form of Eq. (11).

Thesecond and the fourth terms in Eq. (11), each ofwhich is the orientational entropy in k_B units, are calculatedas

Here it is important to note that this termrepresents a reference state such that the expected value ofthe orientational energy for each type of contacting residue pairin the native structures is equal to zero. Thus, thisorientational potential represents simply the suitability of a relative orientationbetween contacting residues, but does not represent at all whethera contact between residues is favorable or not. The latteris supposed to be represented in the present scheme bythe usual contact energy e aa[prime]c . The reference distribution of residue-residueorientations for these orientational potentials is the uniform distribution, andnot the overall distribution for all types of amino acidpairs employed by others.³³^,³⁴^,³⁵ Therefore, for residue pairs whose distributionscoincide with the overall distribution, the latter potentials give alwaysno preference but the present potentials give a preference. Thisis a desirable behavior for orientational potentials, because such anoverall distribution of residue-residue orientations would not be an intrinsiccharacteristic of non-native conformations but rather of native structures ofproteins.

Instead of directly evaluating the frequency distributions of relativeresidue-residue orientations in angular space, we estimate it with aseries expansion in spherical harmonics functions. The use of sphericalharmonics functions to represent orientational distributions of residue-residue pairs wasattempted by Onizuka et al.³³ and Buchete et al.³⁴^,³⁵ The probability densityis expanded as follows in the series of spherical harmonicsfunctions which makes a complete orthonormal system with the ( theta , phi , Theta , Phi , Psi )variables.

g is represented as

where Y lm is the normalizedspherical harmonics function, P lp|mp| is the associated Legendre function; Pwith m_p = 0 is the Legendre polynomial. Then, the coefficients inthe expansion of Eq. (18) can be calculated from theobserved density distribution by

Thus, the coefficient of the firstconstant term in Eq. (18) that corresponds to the uniformdistribution is obvious;

Buchete et al.³⁴^,³⁵ employed spherical harmonics functions onlyfor smoothing the frequency distributions of residue-residue relative orientations observedin angular coordinate space. However, to estimate the expansion coefficients,the formal representation of an observed probability function with the delta function can be used,³³ that is,

and then, theexpansion coefficients are calculated as

where ( theta _ľ, phi _ľ, Theta _ľ, Phi _ľ, Psi _ľ) is a setof angles observed for the contact ľ between residue typesa and a, and w_ľ is a weight for thiscontact. The summations in the equations above are over allcontacts of amino acid types a versus a. A contactbetween amino acid types a and a is counted asone half of a contact for a versus a andanother half for a versus a; N_aa + N_aa is equal tothe actual number of contacts between amino acid types aand a. Thus, a weight w_ľ is equal to 0.5w^c,where w^c is a sampling weight for each protein thatis described in the section "Datasets of protein structures used."In Eq. (24), residues are regarded to be in contactif the geometric centers of side chains or C atomsfor GLY are within 6.5 Ĺ.

The sample size limitsthe frequencies of those modes whose expansion coefficients can bereliably estimated. High order terms are less reliably estimated thanthe low order terms. Bayesian statistical analysis suggests using "pseudocounts" for expected occurrences of residue pairs.⁸^,⁴⁹ As a result,the expansion coefficients of the observed distribution are estimated asfollows:

where beta lpmplemekeaa[prime] is taken to be

in order toreduce statistical errors resulting from the small size of samples; beta in Eq. (31) is a parameter to be optimized.Equation (31) means that more samples are required to determinehigher frequency modes. In Eq. (27), the first term becomesmore effective than the second term in the limit ofsmall numbers of N_aa, and inversely the second term becomesmore effective than the first term in the limit oflarge numbers of N_aa.

Then, higher order terms in Eq.(18), which tend to reflect artificial contributions from the smallsize of samples, are ignored by evaluating only the lowerorder terms with

and

where O_cutoff is a cutoff valuefor expansion terms.

In order to reduce the number ofexpansion terms, we choose only terms in the expansion whosecoefficients have absolute values larger than a certain cutoff value.Thus, the probability density function is evaluated as

where His the Heaviside step function which takes a value ofone for zero and positive values of the argument andis otherwise zero. Finally the estimate of the probability densityf_aa( theta , phi , Theta , Phi , Psi ) is cut off at sufficiently low and high valuesin such a way that its logarithm takes a value within an appropriate range; for example, –7–ln f_aa( theta , phi , Theta , Phi , Psi ) + ln(c 00000aa[prime] g₀₀₀₀₀)1.

The orientationalentropy defined by Eq. (17) is evaluated with the observedprobability distribution of Eq. (24).

D. Repulsive potentials

A repulsive potentialused here is the one described in details in Ref.3 to prevent packing at overly high densities; it consistsof a hard core repulsion e^hc, an excess contact energye ire , and a repulsive packing potential e,

where S_w isdefined by Eq. (7). The repulsive packing potentials e irp forthe 20 types of residues are estimated from the observeddistributions of the numbers of contacting residues in dense regionsof protein structures by assuming a Boltzmann distribution.³ N(a_i,n ic ) is theobserved number of residues of type a_i that are surroundedby n residues in the database of protein structures. q aic isa coordination number, which is defined as the maximum feasiblenumber of contacting residues around a residue, for the aminoacid of type a_i. epsilon in Eq. (40) is a smallvalue ( epsilon = 10^–6) that is added to avoid the divergence ofthe logarithm function. The observed distribution N(a_i,n ic ) used here isone⁴⁴ compiled from 2129 protein species representatives of the SCOP⁴⁸Release 1.53 with our sampling method.³

E. Short-range potentials

The short-rangepotential is evaluated here by the sum of dihedral angledependent energies e ais ( phi _i, psi _i) over all residues:

For this secondary structurepotential, a 10° mesh over ( phi , psi ) space is used tocount frequencies of amino acids observed in protein native structures,and this intraresidue potential e as for each amino acid typea is evaluated as

where N_a( phi , psi ) is the number ofamino acids of type a at ( phi , psi ) observed in proteinnative structures, and N_a is their sum over the entire( phi , psi ) space, that is, the number of amino acids oftype a. The second term is a constant term thatcorresponds to a reference energy, so that the ( phi , psi ) energyexpected for each type of residue in the native structuresis equal to zero.

The observed distribution N_a( phi , psi ) used hereis one⁴⁴ compiled from 2129 protein species representatives of theSCOP⁴⁸ Release 1.53 with the sampling method³ used to reducethe weights of contributions of structures having high sequence identity.

F. Datasets of protein structures used

To estimate the orientational potential,proteins each of which represent a different protein fold werecollected. Release 1.61 of the SCOP database⁴⁸ was used forthe classification of protein folds. Representatives of species are thefirst entries in the protein lists for each species inSCOP; if these first proteins in the lists are notappropriate (see below) to use, for the present purpose, thenthe second ones are chosen. These species are all thosebelonging to the protein classes 1–5; that is, classes ofall alpha , all beta , alpha / beta , alpha + beta , and multidomain proteins. Classesof membrane and cell surface proteins, small proteins, peptides, anddesigned proteins are not used. Proteins whose structures⁵⁰ were determinedby NMR or having stated resolutions worse than 2.5 Ĺare removed to assure that the quality of proteins usedis high. Also, proteins whose coordinate sets consist either ofonly C atoms, or include many unknown residues, or lackmany atoms or residues, are removed. In addition, proteins shorterthan 50 residues are also removed. As a result, theset of species representatives includes 4435 protein domains; this datasetis named here as dataset A.

The recognition power ofthe orientational potentials for the protein native structures is evaluatedby using decoy sets, "Decoys'R'Us." ³⁹ To avoid a bias, orientationalpotentials to be tested are compiled from a dataset ofprotein structures, in which native proteins included in the decoysets are removed; the total number of proteins is reducedto 4369; this dataset is named dataset B.

Also, toremove sampling biases that result from sequence similarities among theserepresentative proteins, a sampling weight for each protein is determinedby the sampling method based on a sequence identity matrixbetween sequences, which is described in Ref. 3. In otherwords, each of the structures having similar sequences is sampledwith a weight less than 1. As a result, the4435 protein sequences of the dataset A correspond to theeffective number, 3522, of sequences and include the effective number,1 467 302, of residue-residue contacts. The 4369 sequences in the otherprotein dataset B corresponds to the effective number, 3506, ofsequences and include the effective number, 1 463 806, of contacts. Theorientational distributions of contacting residues are evaluated in the multimericstate of the complete protein structure for each protein domain.

III. RESULTS

A. Local coordinatesystem affixed to each residue

In order to describe the relativedirectional and rotational positions of contacting residues, a local coordinatesystem defined as in Fig. 1 is affixed to eachresidue. Here the local coordinate system is defined for foldrecognition, based only on the main chain atoms of N,C, and C to represent the orientational relationship between themain chains of contacting residues rather than representing³³^,³⁴^,³⁵ those relationshipsbetween the side chains. The origin O of the localcoordinate system is located at the C position of eachresidue. The Y and Z axes are ones formed bythe vector product and the sum of the unit vectorsfrom N to C and from C to C, respectively.That is, the Y and Z axes are taken tobe perpendicular to and in the plane of the threeatoms N, C, and C, respectively. These form a right-handedcoordinate system. There are two degrees of directional freedom andthree degrees of rotational freedom in the relative orientation ofone residue to another in contacting residue pairs. The relativedirection and rotation of one residue to another in contactingresidues are represented by polar angles ( theta , phi ) and Euler angles( Theta , Phi , Psi ), respectively.

Figure 1.

B. Orientational distributions of contacting residues

Release 1.61 ofthe SCOP database⁴⁸ for classification of protein folds has beenused to choose representatives for different protein folds. In the4435 chosen representative proteins, which correspond to the 3522 effectivenumber of sequences, the 1 467 302 effective number of residue-residue contactsare observed and used here to evaluate the statistical distributionof relative residue-residue orientations for each type of residue pair.The orientational distributions are evaluated in the multimeric state ofa whole protein structure for each protein domain.

As describedin the Methods section, the sample size limits the frequenciesof modes whose expansion coefficients can be reliably estimated. Here,values in the range 4–14 are used for l pmax , l emax ,and k that are the maximum values of l_p, l_e,and k_e which are the highest frequency modes to beestimated. However, even though each of (l_p, m_p, l_e, m_e,k_e) is sufficiently small, their combinations may correspond to highfrequency modes. The number of modes lower than or equalto (l_p, m_p, l_e, m_e, k_e), O_{l_pm_pl_em_ek_e} defined by Eq.(32), is used as a one-dimensional projection of (l_p, m_p,l_e, m_e, k_e) on a frequency axis. To remove highfrequency modes, only frequency modes less than and equal toO_cutoff are utilized. In addition, only significant terms in theexpansion of Eq. (35) whose coefficients take larger absolute valuesthan the value of a cutoff, c_cutoffc 00000aa[prime] , are used toestimate the distributions of relative residue-residue orientations.

Deviations from theuniform distribution in the estimated orientational distributions can be measuredby reductions in orientational entropy. In the case of theuniform distribution, the orientational entropy defined by Eq. (17) isequal to –ln(c 00000aa[prime] g₀₀₀₀₀) = 6.900 in k_B units; k_B is the Boltzmannconstant. The estimate of orientational entropy for each type ofresidue pair and the number of significant terms required forthe estimation depends on the resolution of the potentials, thatis, the values of l pmax , l emax , and k, and alsothe cutoff parameters of O_cutoff and c_cutoff, and beta forthe correction for a small sample size. Orientational entropies estimatedwith various values of the parameters are shown in Fig.2, and the numbers of significant terms required are plottedin Fig. 3. Orientational entropies and the numbers of significant terms averaged with a weight of the number ofcontacts over all residue pairs are plotted against the cutoffvalue of the coefficient for expansion terms, c_cutoff. Triples ofdigits near curves in the figure indicate the values of(l pmax , l emax , and k). The entropy reduction is large whenthe resolution of the potential increases. The estimate of orientationalentropy with l = l = k = 4,5,6 almost converges at the cutoff value, c_cutoff = 0.025.The number of significant terms decreases almost exponentially with thecutoff value, c_cutoff; see Fig. 3. The number of significantterms required for each type of residue pair is relatedto the orientational entropy for the residue pair. Figure 4shows the correlation between the orientational entropies and the numberof significant terms. As expected, many significant terms tend tobe required for residue pairs whose orientational entropies are large.The frequency distribution of the number of significant terms forthe 210 types of residue pairs is shown in Fig.5, indicating that the orientational distribution strongly depends on thetype of residue pair.

The orientational entropies –ln f_aa for eachtype of residue pair are listed in Table I. Residuetype "r" in Table I means any type of residue.As already noted in the Methods section, in principle thismatrix is symmetrical. The table shows that the matrix isalmost symmetrical, indicating the good quality of their statistical estimates.These values in this table are calculated with l pmax = l emax = k = 6, O_cutoff = O₃₃₃₃₃ = 1792, beta = 0.2, and c_cutoff = 0.025.

Orientational entropies for residue pairs with GLYappear to be relatively large. Also orientational entropies for residuepairs with PRO tend to be larger than those forothers but smaller than those for residue pairs with GLY.Residue pairs TRP-CYS/CYS-TRP have the smallest orientational entropies. Orientational entropiesfor residue pairs with CYS and GLU are relatively small.As expected, CYS-CYS, GLU-GLU, GLU-ASP/ASP-GLU, and LYS-LYS have relatively smallorientational entropies, probably because of S–S bond interactions and charge-chargeinteractions.

C. Distributions of residue orientations depend significantly on Eulerangles

It is interesting to see how much the entropy reductionsoriginate either from polar angle dependences or Euler angle dependencesonly, and from cross correlations between them; the orientational entropyis defined by Eq. (17) and estimated by Eq. (36).

In Fig. 6, the broken line shows the maximum valueof orientational entropy which each type of amino acid paircan take; it is equal to –ln(c 00000aa[prime] g₀₀₀₀₀) = 6.900 for the uniformdistribution. The abscissa indicates the amino acid pair identification number;amino acid types are numbered in the order of aminoacids written along the abscissa. Thus, the amino acid pairidentification number one means a CYS-CYS pair and 400 meansa PRO-PRO pair. The lowest solid line is for adistribution estimated with l pmax = l emax = k = 6. The highest solid line shows theorientational entropies estimated with l = 6, l = k = 0, and therefore the contributionto the total entropies from polar angle dependences. The middleline shows the orientational entropies estimated by subtracting the entropy,6.900, for the uniform distribution from the sum of entropiesestimated with l pmax = 6, l emax = k = 0, and with l = 0, l = k = 6. In otherwords, the difference between the highest solid line and themiddle line shows contributions to the total entropies from Eulerangle dependences. The difference between the middle and lowest solidlines corresponds to contributions from the cross correlation between polarangle and Euler angle dependences. Cutoff values for significant termsin the expansion are O_cutoff = 1792 and c_cutoff = 0.025. The parameter forthe correction for a small sample size is beta = 0.2.

Figure 6.

Theseresults clearly indicate that only small amounts of entropy reductionoriginate purely from polar angle dependences, and that the distributionof residue orientations has significantly large correlations between polar andEuler angles. Also, the fact that the lowest solid lineis more jagged than the upper lines indicates that thedistributions as a function of polar and Euler angles, canreflect more differences among the types of residue pairs thanthe others. Thus, the discriminations of native structures from non-nativefolds is expected to be improved by taking account ofEuler angle dependencies in the distributions of residue-residue orientations.

D.Recognition power for native structures

We have evaluated the recognition powerof the orientational potentials for native structures using independently constructeddecoy sets, which are maintained at "http://dd.stanford.edu" as the database"Decoys'R'Us." ³⁹ Here, the group of decoy sets named "multiple" areemployed. This group of decoy sets consists of the followingten families of decoy sets classified by methods used togenerate decoys. Each decoy set provides multiple non-native structures aswell as the native structure.

(1) The "4state_reduced" family containingdecoy sets for seven small proteins. C positions for these decoyswere generated by exhaustively enumerating ten selectively chosen residues ineach protein using a four-state off-lattice model.³⁶

(2) The "fisa"family containing decoy sets for four alpha helical proteins. Themain chains for these decoys were generated using a fragmentinsertion simulated annealing procedure to assemble nativelike structures from fragmentsof unrelated protein structures with similar local sequences using Bayesianscoring functions.³⁷

(3) The "fisa_casp3" containing decoy sets for proteinspredicted by the Baker group for CASP3. The same methodas for the fisa set was used to generate themain chains and side chains for these decoys.

(4) The"hg_structal" family containing decoy sets for 29 globins. Each decoyhas been built by comparative modeling using 29 other globinsas templates with the program "segmod." ⁵¹

(5) The "lattice_ssfit" familycontaining decoy sets for eight small proteins generated by abinitio methods.³⁸

(6) The local minima decoy set family ("lmds")which containing decoy sets derived from the experimental secondary structuresof ten small proteins belonging to diverse structural classes. Eachdecoy is at a local minimum of an energy function.

(7) The second version, "lmds_v2," of the local minima decoyset family, lmds.

(8) The "semfold" family containing decoy setsfor six proteins.

(9) The "ig_structal" family containing decoy setsfor 61 immunoglobulin domains. Each decoy has been built bycomparative modeling using all the other immunoglobulins as templates withthe program segmod.⁵¹

(10) The "ig_structal_hires" family that is ahigh resolution subset of ig_structal, and contains decoy sets for20 immunoglobulins. The resolution range is for this set is1.7–2.2 Ĺ compared to the range of 1.7–3.1 Ĺ forthe full 61 set.

In the following, these families ofdecoy sets are categorized into two classes one of whichconsists of only the last two families above, i.e., thedecoy set group of immunoglobulin domains that are single chainsof a multimer, and the other which contains the restof the decoy families above and is called the decoyset group of monomeric proteins; although hg_structal contains decoy setsfor some hemoglobins which are tetrameric proteins, and the fragmentB of protein A, which is in a complex withimmunoglobulin F_c, is also contained as the decoy set 1FC2in the decoy set families fisa, lmds, and lmds_v2. Thisclassification that depends on whether decoys are a single chainof a multimer is based on the fact that thetrue ground state of those multimeric proteins requires all ofthe chains to be present; it is true especially forcontact energies, although it is not expected for the orientationalenergies developed here or short-range potentials such as the secondarystructure potentials. The decoy set group of monomeric proteins consistsof 79 decoy sets, and the decoy set group ofimmunoglobulin domains consists of 81 decoy sets.

In the evaluationof the recognition performance of potential functions for the nativestructures, proteins contained in the decoy sets have been removedfrom a dataset of proteins from which the orientational potentialsare compiled; that is, the dataset B is used.

E.Evaluation of the performance of potential functions in fold recognition

Theperformance of potential functions in fold recognition is evaluated foreach decoy set by the rank, the logarithm of rankprobability, and the Z score of the native fold inthe energy scale, and by those of the lowest energyfold in the root mean square deviation (RMSD) scale. RMSDmeans the least root mean square deviation between C atomsin overlaps between the native structure and decoys. The rankprobabilities, P_e in the energy scale and P_r in theRMSD scale, are defined as

The Z scores Z_e inthe energy scale and Z_rmsd in the RMSD scale aredefined as

where overline(Edecoy) and sigma _E are the mean andthe standard deviation of energies of decoys, and overline(RMSDdecoy) and sigma _rmsd are the mean and the standard deviation of RMSDof decoys. RMSD_lowest is the RMSD of the lowest energyfold.

The correlation coefficient R of rank order between theenergies and RMSDs of decoys is also listed in sometables, because it was used in Ref. 25.

F. Howimportant are the Euler angle dependencies of relative residue orientationsfor fold recognition?

First, we examine how the discrimination power isimproved by taking account of the Euler angle dependencies ofrelative orientations between residues. In the case of l emax = k = 0, Eulerangle dependencies are completely ignored. Thus, the comparisons of theperformances of discrimination between the cases of l = k = 0 and l,k [not-equal] 0indicate how important the Euler angle dependencies of relative residueorientations are in fold recognition. In Tables II and III,the performances of discrimination are compared among some combinations ofparameters l pmax and l emax for both the decoy set groupsof monomeric proteins and immunoglobulin domains; k was taken tobe equal to l. The full lists of these tablesare provided in the auxiliary material.⁵² Here, the potentials consistof the orientational potential e^o only. In these tables, theperformances of discrimination are evaluated by the number of decoysets (no. of tops) in which the native structure isthe lowest energy fold, and also the averages over thedecoy sets of the logarithms of rank probabilities P_e inthe energy scale and P_r in the RMSD scale, andthe mean Z scores Z_e of the native folds inthe energy scale.

Table II(a) shows the dependencies of therecognition power on the resolution in polar angles; note thatEuler angle dependencies are completely ignored with l emax = k = 0. Both themonomeric protein decoy set group and immunoglobulin decoy set groupshow similar characteristics; when the resolution, that is, the valueof l pmax increases up to 7, the number of topranks tends to increase and the means of the logrank probabilities, overline(ln Pe) in the energy scale and overline(ln Pr) inthe RMSD scale, tend to be improved with more negativevalues. The potentials with 7<l pmax <14 appear to yield worse resultsthan that of l = 7. At l = 14, the orientational potential showsa similar performance to that for l = 7. These results indicatethat the improvement in the performance of fold recognition isnot monotonic with the number of expansion terms, and alsothat there may be an intrinsic periodicity in the polar-angledistribution of residue-residue orientations.

Similar performance is obtained for boththe decoy set group by using the Euler angle distributionsof residue-residue orientations. The dependencies of the recognition power onthe resolution in Euler angles are shown in Table II(b).For this table, l pmax = 0 is used, so that polar-angle dependenciesare completely ignored. The best result in the cases of4l emax = k7 is obtained in the case of the highest resolution,l = 0,l = k = 7. In comparison with the results of l = 7,l = k = 0, some improvementis clearly observed for the immunoglobulin decoy set group, althoughthe performance of z score Z_e is slightly worse forthe monomeric protein decoy set group. The native structures ofimmunoglobulin domains consist mainly of beta sheets. Hydrogen bonds between beta strands are essential to maintain beta sheets. In additionto hydrogen bonds, residue-residue packing between a beta sheet andother parts may require relatively stringent orientations between residues, especiallyfor Euler angles.

To improve the performance, correlations between polarand Euler angle dependencies must be taken into account. TableIII shows the improvements in recognition performance obtained by takingaccount of the correlations between polar and Euler angle dependencies.Table III(a) indicates that the recognition performance is improved about10% to 30% for both of the decoy set groupswith increase of resolution, but has a limitation around l pmax = l emax = k~6,O_cutoff~1792,probably owing to the sample size. However, the comparison ofthe results for l = 7,l = k = 0, l = l = k = 7, O_cutoff = O₇₇₀₀₀ = 64, and l = l = k = 7, O_cutoff = O₀₀₇₇₇ = 960 indicatesthat including small numbers of lower orders of cross termsbetween polar and Euler angles does not lead to animprovement in performance and sufficient numbers of cross terms arerequired to improve the performance. This may be one ofreasons why Onizuka et al.³³ observed worse rather than better performancesby taking account of Euler angle dependencies in orientational distributions.

Dependencies of the performance on the cutoff parameters are alsoexamined. In cases of low resolution in which only polardependencies are taken into account, the effects of the cutoffparameter c_cutoff on the recognition performance are not clear forthe cases of c_cutoff = 0,0.025,0.5. However, in the cases of highresolution the value 0.05 for c_cutoff is not small enoughto reproduce the orientational distributions for fold recognition. See tablesin the auxiliary material⁵² for details. The threshold c_cutoff forsignificant expansion terms should be set as small as c_cutoff~0.025.This is consistent with the fact that as shown inFig. 2 the mean orientational entropies can be reproduced byemploying c_cutoff~0.025. Using a value for c_cutoff lower than 0.025does not always yield good performance and may even decreasethe recognition power, probably because the expansion terms with smallvalues of coefficients tend to correspond to statistical noise. Thus,the value of 0.025 is used here for c_cutoff.

Theeffects of beta for a small sample correction are shownin Table III(c). The potential shows a better performance around beta = 0.2; N_aa/ beta 18 000(= 1 467 302/400/0.2). This means that the first digit will besignificant in the estimated values of the expansion coefficients forthe terms of O_{l_pm_pl_em_ek_e} = 1792, because beta lpmplemekeaa[prime] in Eq. (31) becomesabout 0.1 for O_{l_pm_pl_em_ek_e} = 1792. Thus, the values of beta = 0.2 andO_cutoff = 1792 would be consistent with one another.

The parameters ofl pmax = l emax = k = 6 with O_cutoff = 1792, c_cutoff = 0.025, and beta = 0.2 are employed here, althoughO_cutoff = 960 is also good, and could be chosen if onewants to reduce the number of expansion terms. The discriminationof the native structures is successful for 37 of the79 monomeric decoy sets and for 59 of the 81immunoglobulin decoy sets using the orientational energy.

The value ofln P_e for each decoy set is shown in Fig. 7;(a) for the decoy sets of monomeric proteins, and (b)for the immunoglobulin decoy sets. The abscissa shows the identificationnumber of the decoy set that is listed for eachdecoy in tables in the auxiliary material.⁵² Cross marks andsolid lines indicate the values for the case of l pmax = 7,l emax = k = 0;both are the best case for each decoy set groupif only polar-angle dependencies are taken into account. Open circlesand broken lines are for the case of l = l = k = 6. Formost decoy sets, the performance in the discrimination of thenative structures is improved.

Figure 7.

G. How important are relative orientationsbetween residues in fold recognition?

A summary of the effects foreach potential component in Eq. (1) on the performance infold recognition is listed in Table IV. The energy termsincluded in the total energy potential are listed in thefirst column of the table. The performances of those totalenergy potentials are evaluated by the number of top ranks(no. of tops), the means over all decoy sets ofthe logarithms of rank probabilities ln P_e in the energy scaleand ln P_r in the RMSD scale, and of the Zscores Z_e in the energy scale and Z_rmsd in theRMSD scale, and the medians of those Z scores inall decoy sets. Also the mean values R-bar over alldecoy sets of the correlation coefficients of rank order betweenthe energies and RMSDs of the decoys are listed forreference.

First, the results for the monomeric protein decoy setgroup clearly show the orientational potential e^o can achieve aperformance comparable to the simple contact potentials, without and withthe collapse energy, Delta e^c and e rrc + Delta e^c, indicating that residues inthe non-native structures are not well positioned with respect tothe relative orientation between them.

It should be noted herethat for the monomeric decoy set group the performance ofthe contact potential Delta e^c without the orientational energy is slightlybetter than that of the orientational energy e^o only, butit is significantly worse for the immunoglobulin decoy set group.Including the collapse energy e rrc causes the performance to becomeeven worse, indicating that the contact potential without the orientationalpotential does not work at all for these decoy sets.In the case of multimeric proteins, the evaluation of contactenergies for residues on the surface of the domain requiresother domains and chains to be present. When other domainsand chains are not available for a given domain, residue-residuecontacts between domains and chains cannot be evaluated. Thus, asalready mentioned, unlike short-range potentials, the true ground state ofthose multimeric proteins in the contact potential requires all ofthe chains to be present. Especially in the case ofimmunoglobulin molecules, the interface among constant and variable domains occupiesa large portion of the surface of the domains. Thus,the potential consisting of the simple contact energy shows anextremely poor performance for the immunoglobulin decoy sets. On theother hand, the orientational potential only measures how good orbad the relative orientations between contacting residues are, and thusits evaluation does not necessarily require the presence of alldomains and chains in multimeric proteins, although it would bemore precisely measured if all contacting residues were known; asseen from Eq. (11), the expected value of the orientationalenergy for contacting residues in native protein structures is adjustedto be equal to zero.

It is noteworthy that inTable IV(a) a large improvement in performance is not seenfor the monomeric protein decoy set group, in which decoyshave relatively compact structures, by adding the residue-type independent contactenergy e rrc to the residue-type dependent contact potential Delta e^c exceptfor the case of the energy Delta e^c + e^o. This fact indicatesthat optimizing potentials is not simple.

It is interesting tonote that the inclusion of the repulsive potential e^r partiallyimproves the performance for the immunoglobulin decoy set group, incomparison with the case for the monomeric decoy set group.The repulsive potential favors packing densities similar to the residuedensities observed in native structures. Thus, the fact that therepulsive potential works well for these decoy sets may indicatethat these decoys do not mimic well the native structureswith respect to residue density. However, for well designed decoys,the packing potential may work less favorably for the nativefold as shown in the case of the monomeric decoyset family.

The performance of the potential function is furtherimproved for both of the present decoy sets by includingthe simple short-range ( phi , psi ) potential, strongly indicating that the short-rangeinteractions should not be ignored in fold recognition.

The improvementof the performance for fold recognition due to the orientationalpotential is also observed for almost all decoy sets. InFig. 8, the value of the logarithms of rank probabilitiesin the energy scale ln P_e for each decoy set isplotted against the identification number of the decoy set thatis listed for each decoy in Table V and tablesin the auxiliary material;⁵² (a) is for the monomeric proteindecoy set group and (b) for the immunoglobulin decoy setgroup. Open circles and broken lines show the values forthe potential function that includes the orientational energy e^o, andcross marks and solid lines are for the potential withoutthe orientational energy. Even in the decoy sets of themonomeric proteins, ln P_e for each decoy set tends to bemore negative in the potential that includes the orientational energy.

Figure 8.

H. Comparison of the performance of the present potentialfunction with other potentials

The performance of the present potential functionfor each decoy family is listed in Table V, andthat for each decoy set is provided as tables inthe auxiliary material.⁵²

Table V and the tables in theauxiliary material⁵² also show the performances of some of thescoring functions²⁴^,²⁵^,³³^,³⁴^,³⁵ that have already been tested for some ofthese decoys. Those scoring functions referred to here are fourstatistical potentials and one atomic semiempirical potential. These four statisticalpotentials are the atomic contact potential developed by Samudrala andMoult,¹³ the distance-dependent pair potential optimized for fold recognition byToby and Elber,²⁴ the optimal Chebyshev-expanded function -minimizing Z scoresdevised by Fain, Xia, and Levitt,²⁵ and the distant-dependent angularpotential named "3C326" developed by Onizuka et al.³³ The atomic semiempiricalpotential referred to here is a potential based on theCHARMM gas phase implicit hydrogen force field in conjunction witha generalized Born implicit solvation term by Dominy and Brooks,¹⁸which includes specifically a generalized Born, Coulomb, nonpolar solvation andvan der Waals energy terms. Data for the potential ofSamudrala and Moult¹³ are taken from Fain, Xia, and Levittet.²⁵

The decoy sets of protein 1FC2 are found in thethree decoy set families of fisa, lmds, and lmds_v2, andin all of these decoy sets the present potential failedto identify the native folds. The coordinates of the nativefold 1FC2 is for the fragment B of protein Ain a complex with immunoglobulin F_c. All chains that interactwith the fragment B may be required to estimate theground state energy for this structure, especially because this fragmentis only 43 residues long. The decoy sets of protein1BBA are also found in two decoy set families, lmdsand lmds_v2. This protein is pancreatic hormone that consists ofonly 36 residues, and is expected to interact with relativelylarge receptor proteins. Protein 1NKL in lattice_ssfit and semfold canbind lipid, and protein 1BGA8-A in fisa_casp3 is found inthe trimeric state in the PDB coordinate file. Thus, onereason why the present potential fails for some decoy setsmay be that some chains are missing for the properestimation of the ground state for these decoy sets. Otherwise,there could be interactions that are not taken into accountin the present potential function.

However, overall the present potentialfunction performs well in comparison with other scoring functions. Thediscrimination for the native structure is successful for 61 of79 monomeric decoy sets and for 68 of 81 immunoglobulindecoy sets. Also, the mean Z score Z_e in theenergy scale which is equal to –4.45 for monomeric decoysets and –3.29 for immunoglobulin decoy sets is statistically significant.For the decoy sets in the globin family hg_structal, interactionsbetween a heme and surrounding residues are not taken intoaccount. Although the present potential fails to identify the nativefold for 7 of 29 decoy sets in this family,the RMSD of the lowest energy fold is below 1Ĺ in 4 of these 7 decoy sets.

Table Vclearly shows that the present method outperforms the other potentialsfor all the decoy families except for the fisa andfisa_casp3 decoy families for which the potential developed by Tobyand Elber is better in the mean value of energyZ score, although the present potential performs better than theirpotential in the cases of 4state_reduced, lattice_ssfit, and lmds decoyfamilies. One of interesting facts is that the atomic semiempiricalpotential based on the CHARMM potential with a generalized Born,Coulomb, nonpolar solvation and van der Waals energy terms cannotperform better than the present coarse-grained potential, at least forthe reported two decoy families 4state_reduced and hg_structal. At thecurrent development stage of atomic potentials, identifying native structures appearsto be a hard task, and atomic potentials without explicitlytaking account of solvent molecules cannot necessarily perform better thancoarse-grained and residue-level statistical potentials. On the other hand, explicitlytaking account of water molecules would take too much CPUtime to estimate conformational free energies. This fact motivates ourstudies to develop coarse-grained potentials.

The correlation coefficient R ofrank order between the energies and RMSDs of decoys islisted in Table V and tables in the auxiliary material.⁵²because it was used also in Ref. 25. There aremany decoy sets for which the potential succeeds in identifyingthe native fold and for which both values of Zscores, Z_e and Z_r, are large but the correlation coefficientR of rank order has values smaller than 0.3; seethose values for the decoy set families of lattice_ssfit, lmds,lmds_v2, and semfold. Thus, generally speaking, this measure R maybe inappropriate for the evaluation of the performance of scoringfunctions. It may be appropriate only for some decoy sets,which consist of near-native decoys only, such as the decoysets in 4state_reduced.

IV. DISCUSSION

The present analyses of relative residue-residue orientations clearlyindicate that the distribution of residue-residue orientations strongly depends onthe Euler angles that specify three degrees of rotational freedomfor one residue relative to another, and it is possibleto improve the performance of an energy potential in foldrecognition by taking account of the Euler angle dependencies inresidue-residue orientations.

In the analyses of relative residue-residue orientations byBuchete et al.,³⁴^,³⁵ the Euler angle dependencies of residue-residue orientations werenot completely taken into account, probably because the number ofresidue-residue pairs observed in known protein structures is relatively smallto reliably estimate the orientational distribution with the required resolutionby dividing space into many cells and counting samples observedin each cell. In order to overcome such problems, wechose a method proposed by Onizuka et al.³³ in which theobserved distribution of residue-residue orientations is represented as a sumof delta functions each of which represents the observed locationin angular space. Then, the distribution of residue-residue orientations isestimated in the expansion with spherical harmonics functions and thecoefficients of the expansion terms are estimated by inversely transformingthe observed distribution represented as the sum of delta functions.

High frequency modes in the expansion must be ignored becausethey reflect artificial contributions originating in the small size ofsamples. Each term in the expansion has a different resolutionwith various combinations of frequencies for each coordinate axis. Atrivial example is that the first term g₀₀₀₀₀ corresponding toa uniform distribution has the lowest resolution. Here, resolution ofeach term is represented by O_{l_pm_pl_em_ek_e}, that is, defined asthe number of frequency modes lower than or equal to(l_p, m_p, l_e, m_e, k_e) by Eq. 32 and onlyterms whose O_{l_pm_pl_em_ek_e} is less than a cutoff value O_cutoffare used. The merit of this method is that thedistribution can be constructed by using only expansion terms whoseresolutions are low enough to be able to be estimatedfrom a limited number of samples of known protein structures.On the other hand, the cell partitioning method has afixed resolution for each coordinate axis, so that high frequencymodes with large values of O_{l_pm_pl_em_ek_e} can be included inthe estimation of orientational distributions.

Because the resolution of eachterm is different from others, each term is differently correctedfor the small size of samples according to its resolution;see Eqs. (27,28,29,30,31,32) In this correction scheme, the number ofresidue-residue pairs required for the estimation of an expansion coefficientc_{l_pm_pl_em_ek_e} increases proportionally with its resolution O_{l_pm_pl_em_ek_e}. The proportionality constantwas determined on the basis of the performance of thepotentials in fold recognition. Also, the maximum resolution that canbe estimated depends on the sample size. The maximum valuesfor l_p, m_p, l_e, m_e, and k_e, and for O_{l_pm_pl_em_ek_e}are determined on the basis of the performance of thepotentials in fold recognition.

Also, the reference distribution of residue-residueorientations for the present orientational potentials is not the overalldistribution for all types of amino acid pairs but theuniform distribution, differing from other works.³³^,³⁴^,³⁵ It depends on decoysets whether the uniform distribution for a reference distribution iseffective. If the structures of decoys have a similar overalldistribution to that of native structures, then it will notbe effective. However, such an overall distribution of residue-residue orientationswould not be intrinsically characteristic of non-native conformations but insteadof native structures of proteins. If so, this overall distributionmay be one of the important characteristics to distinguish protein-likestructures from others. On the other hand, there is noreason to avoid employing the uniform distribution for a referencedistribution. The use of the uniform distribution as a referencedistribution is desirable to fully evaluate the orientational distribution ofeach type of contacting residue pair in decoy structures. Ourscheme differs from previous works³³^,³⁴^,³⁵ and allows us to moreproperly evaluate the effectiveness of the orientational potential on foldrecognition.

However, the present method of evaluating orientational energies betweencontacting residues requires the evaluation of a large number ofexpansion terms.⁵³ Although this feature is a trade-off accompanied withthe simplification of representing residues by single points, it canbe an obstacle to using this method in CPU intensivecalculations in which energy evaluations of many conformations are required.To reduce CPU time in the evaluation of orientational energies,orientational energies could be precalculated at grid points in thepolar and Euler angular space, although this approach requires alarge memory and disk space as a trade-off against CPUtime.

In the present work, the total energy in Eq.(1) is assumed to consist of a simple sum ofenergy terms, because each energy potential has been evaluated ina similar manner as the potential of mean force fromstatistical distributions of residues observed in protein structures, avoiding overcountingparticular interactions. One might assume a different weight for eachcontribution to the total energy, and try to optimize aweight for each energy term by minimizing the Z scoreZ_e for the decoy sets.¹⁶ However, equal weights are employedhere for each term, because a set of optimum weightscould strongly depend on the training decoy sets. For example,if bad contacts are removed and torsion angles are optimizedfor decoy structures, then the packing potential and the secondarystructure potential tend to be useless in discriminating the nativestructures from decoys, and optimum weights for those potentials determinedby minimizing the Z score would take on relatively smallvalues. The training decoys for optimizing a weight of eachenergy term in a total potential must be carefully generatedwithout bias. In addition, generating unfolded decoys is also necessaryto obtain an appropriate value with such an optimizion methodfor the collapse energy, which is represented as e rrc andwhich is an extremely important energy for a protein tofold that compensates for the large conformational entropy loss ofcompact conformations.

REFERENCES

Citation links [e.g., Phys. Rev. D 40, 2172 (1989)] go to online journal abstracts. Other links (see Reference Information) are available with your current login. Navigation of links may be more efficient using a second browser window.

Auxiliary Material (EPAPS)

FIGURES

Full figure (5 kB)

Fig. 1. The definitions of a local coordinate system affixed to eachresidue. The origin O of the local coordinate system islocated at the C position of each residue. The Yand Z axes are ones formed by the vector productand the sum of the unit vectors from N toC and from C to C, respectively. The X axisis taken to form a right-handed coordinate system. The relativedirection and rotation of one residue to the other incontacting residues are represented by polar angles ( theta , phi ) and Eulerangles ( Theta , Phi , Psi ), respectively. First citation in article

Full figure (9 kB)

Fig. 2. Dependencies of orientational entropies on parameters in theestimation of the orientational potentials. The orientational entropies averaged overall types of residue pairs with the weight of thenumber of contacts N_aa for each type of residue pairare plotted against the cutoff values for the expansion coefficients.Triplets of digits near solid lines indicate the values of(l pmax ,l emax ,k); for the non-solid lines, l = l = k = 6 is used. The otherparameters are beta = 0.2 for all lines, and O_cutoff = O₃₃₃₃₃ = 1792 for solidlines. The dotted line shows the case of O_cutoff = O₀₀₇₇₇ = 960, thedotted broken line is for O_cutoff = O₁₁₅₅₅ = 1584, and the broken lineis for O_cutoff = O₂₂₄₄₄ = 2025. First citation in article

Full figure (9 kB)

Fig. 3. Dependencies of the number of significant expansion termson estimation parameters for the orientational potentials. The numbers ofsignificant terms averaged over all types of residue pairs withthe weight of the number of contacts N_aa for eachtype of residue pair are plotted against the cutoff valuesfor expansion coefficients. Triplets of digits near curves indicate thevalues of (l pmax ,l emax ,k); for the non-solid lines, l = l = k = 6 is used.The other parameters are beta = 0.2 for all lines, and O_cutoff = O₃₃₃₃₃ = 1792for solid lines. The dotted line shows the case ofO_cutoff = O₀₀₇₇₇ = 960, the dotted broken line is for O_cutoff = O₁₁₅₅₅ = 1584, and thebroken line is for O_cutoff = O₂₂₄₄₄ = 2025. First citation in article

Full figure (9 kB)

Fig. 4. Correlation between the number of significantexpansion terms and orientational entropy. Those values for 210 differenttypes of residue pairs, which are averaged over residue pairs(a,a) and (a,a), are plotted here. The orientational potentials areevaluated with l pmax = l emax = k = 6, O_cutoff = 1792, beta = 0.2, and c_cutoff = 0.025. First citation in article

Full figure (8 kB)

Fig. 5. Histograms of the numbers of significantexpansion terms for the 210 types of residue pairs; thenumbers of significant expansion terms are averaged over residue pairs(a,a) and (a,a). The size of a bin is 200.These data are those for l pmax = l emax = k = 6, O_cutoff = 1792, beta = 0.2, and c_cutoff = 0.025. First citation in article

Full figure (22 kB)

Fig. 6. Orientational entropies, –ln f_aa,for three types of distributions are plotted against the identificationnumber of amino acid pair (a,a). Amino acid types arenumbered in the order of amino acids written along theabscissa; see text for details. The broken line shows theentropy, 6.900, for a uniform distribution. The lowest solid lineshows the distribution with polar and Euler angle dependencies, l pmax = l emax = k = 6.The highest solid line shows the distribution with l = 6,l = k = 0 thatdepends on polar angles only. The middle solid line showsthe distribution that depends on polar angles with l = 6, andon Euler angles with l = k = 6, but ignores any correlation betweenpolar and Euler angles. The values of other parameters areO_cutoff = 1792, beta = 0.2, and c_cutoff = 0.025. First citation in article

Full figure (27 kB)

Fig. 7. The effects of Euler angle dependencies inthe orientational potentials on the performance for fold recognition. Thevalue of logarithm of rank probability P_e in the energyscale for each decoy set is plotted against the identificationnumber of the decoy set that is listed in TableV and tables in the auxiliary material (Ref. 52). Theleft figure (a) corresponds to the decoy set group ofmonomeric proteins in "Decoys'R'Us" (Ref. 39), and the right figure(b) to the immunoglobulin decoy set group. The potential functionused here consists of orientational potentials e^o only. Cross marksand solid lines show the case for the orientational potentialwith l pmax = 7, l emax = k = 0, O_cutoff = [infinity] , and c_cutoff = 0.025. Open circles and broken lines showthe case for the orientational potential with l = l = k = 6, O_cutoff = 1792, and c_cutoff = 0.025. First citation in article

Full figure (28 kB)

Fig. 8. Theeffects of the orientational potentials on performance for fold recognition.The value of logarithm of rank probability P_e in theenergy scale for each decoy set is compared between twotypes of potential functions, one of which includes the orientationalpotential. The abscissa shows the identification number of each decoyset that is listed in Table V and tables inthe auxiliary material (Ref. 52). (a) The potentials for monomericprotein decoy sets consist of e rrc + Delta e^c for cross marks andsolid lines, and e + Delta e^c + e^o for open circles and broken lines.(b) The potentials for immunoglobulin decoy sets consist of Delta e^c + e^rfor cross marks and solid lines, and e^o + e^r for opencircles and broken lines. The orientational energies are evaluated withl pmax = l emax = k = 6, O_cutoff = 1792, beta = 0.2, c_cutoff = 0.025. First citation in article

TABLES

Table I. Orientationalentropy, –ln f_aa, in k_B units for each residue pair (a,a);a (a) is shown in each row (column), r is forall types of residues, and the parameters used are l pmax = l emax = k = 6,O_cutoff = 1792, beta = 0.2, and c_cutoff = 0.025.
C M F I L V W Y A G T S Q N E D H R K P r
C 3.97 4.06 4.52 4.31 4.54 4.33 3.62 4.33 4.38 4.74 4.40 4.43 4.02 4.25 3.96 4.00 3.96 4.26 4.01 4.50 5.12
M 4.07 4.47 4.69 4.44 4.58 4.45 4.23 4.64 4.50 4.88 4.48 4.57 4.24 4.42 4.15 4.16 4.21 4.35 4.04 4.78 4.97
F 4.51 4.71 4.92 4.73 4.88 4.68 4.55 4.86 4.84 5.09 4.82 4.83 4.51 4.82 4.60 4.60 4.60 4.67 4.50 4.90 5.16
I 4.31 4.45 4.72 4.38 4.52 4.34 4.42 4.66 4.36 4.91 4.47 4.57 4.27 4.47 4.13 4.27 4.34 4.44 4.10 4.82 4.77
L 4.53 4.57 4.88 4.52 4.68 4.55 4.60 4.78 4.43 5.01 4.62 4.64 4.35 4.65 4.20 4.41 4.68 4.56 4.28 5.06 4.86
V 4.31 4.46 4.69 4.33 4.55 4.21 4.53 4.65 4.33 4.90 4.44 4.55 4.43 4.60 4.22 4.28 4.43 4.48 4.16 4.80 4.78
W 3.59 4.23 4.53 4.43 4.59 4.53 3.87 4.46 4.78 4.79 4.46 4.51 4.06 4.27 4.29 4.40 4.09 4.28 4.01 4.56 5.21
Y 4.34 4.61 4.85 4.63 4.74 4.62 4.44 4.87 4.85 5.11 4.78 4.80 4.46 4.86 4.76 4.91 4.71 4.66 4.38 4.88 5.23
A 4.34 4.50 4.85 4.33 4.42 4.29 4.76 4.85 3.76 4.88 4.46 4.45 4.37 4.52 4.10 4.05 4.60 4.53 4.20 4.96 4.78
G 4.70 4.88 5.12 4.89 4.98 4.88 4.84 5.13 4.88 5.47 5.12 5.31 5.00 5.30 4.90 4.95 5.06 5.22 4.97 5.35 5.61
T 4.37 4.46 4.82 4.44 4.62 4.44 4.44 4.80 4.46 5.13 4.23 4.54 4.19 4.63 3.95 4.16 4.52 4.62 4.16 4.91 4.95
S 4.42 4.56 4.87 4.56 4.62 4.54 4.50 4.82 4.41 5.30 4.54 4.67 4.42 4.78 4.24 4.33 4.59 4.76 4.48 4.98 5.09
Q 4.02 4.20 4.51 4.21 4.31 4.38 4.07 4.47 4.36 5.02 4.19 4.39 4.15 4.39 3.84 4.03 4.32 4.27 3.91 4.72 4.86
N 4.23 4.41 4.84 4.48 4.61 4.58 4.30 4.85 4.52 5.28 4.65 4.77 4.39 4.84 4.28 4.45 4.59 4.71 4.36 4.97 5.22
E 3.96 4.12 4.59 4.12 4.19 4.18 4.29 4.81 4.10 4.93 3.95 4.22 3.81 4.29 3.72 3.83 4.58 4.39 4.06 4.54 4.71
D 3.96 4.14 4.61 4.24 4.38 4.28 4.42 4.95 4.06 4.95 4.14 4.32 4.03 4.44 3.83 4.13 4.71 4.85 4.46 4.67 4.95
H 3.98 4.20 4.58 4.33 4.66 4.43 4.09 4.73 4.60 5.07 4.51 4.53 4.30 4.60 4.58 4.71 4.40 4.44 4.18 4.63 5.27
R 4.26 4.36 4.68 4.42 4.55 4.46 4.31 4.72 4.54 5.25 4.63 4.75 4.27 4.73 4.37 4.87 4.47 4.66 4.05 4.88 5.08
K 3.97 4.06 4.51 4.09 4.26 4.15 3.99 4.42 4.22 5.00 4.19 4.49 3.94 4.38 4.06 4.48 4.18 4.07 3.85 4.53 4.81
P 4.47 4.76 4.94 4.80 5.06 4.79 4.59 4.91 4.97 5.35 4.89 4.96 4.76 5.00 4.58 4.66 4.68 4.89 4.54 5.19 5.48
r 5.11 4.97 5.15 4.77 4.86 4.77 5.21 5.24 4.78 5.61 4.96 5.09 4.88 5.23 4.72 4.96 5.26 5.08 4.81 5.48 5.18
First citation in article

Table II. Dependencies of the performanceof fold recognition on the resolution of the orientational potential;dependencies on polar or Euler angles.
(a) Dependencies on polarangles
l pcutoff c_cutoff l emax = k = 0, beta = 0.2, O_cutoff = [infinity]
79 monomeric decoy sets 81 Ig decoy sets
No. of tops overline(ln Pe) overline(ln Pr) overline(Ze) No. of tops
 4 0.0 23 –2.79 –2.09 –1.41 29 –2.66 –1.88 –1.45
0.025 22 –2.77 –2.02 –1.41 28 –2.67 –1.82 –1.45
 5 0.0 31 –3.35 –2.57 –1.84 31 –2.68 –1.96 –1.46
0.025 31 –3.37 –2.57 –1.84 30 –2.66 –1.93 –1.45
 6 0.0 27 –3.23 –2.55 –1.77 34 –2.69 –2.19 –1.45
0.025 28 –3.24 –2.58 –1.76 34 –2.68 –2.16 –1.44
 7 0.0 30 –3.45 –2.60 –1.98 45 –2.93 –2.52 –1.57
0.025 31 –3.46 –2.60 –1.98 45 –2.94 –2.53 –1.58
 8 0.0 28 –3.37 –2.59 –1.91 38 –2.73 –2.24 –1.48
0.025 27 –3.36 –2.55 –1.89 39 –2.74 –2.27 –1.49
 9 0.0 25 –3.38 –2.43 –1.92 32 –2.66 –2.06 –1.54
0.025 24 –3.36 –2.44 –1.90 33 –2.68 –2.08 –1.56
10 0.0 27 –3.32 –2.55 –1.83 37 –2.55 –2.13 –1.52
0.025 26 –3.31 –2.49 –1.82 36 –2.52 –2.14 –1.55
11 0.0 28 –3.44 –2.67 –1.94 39 –2.68 –2.16 –1.71
0.025 30 –3.48 –2.82 –1.92 39 –2.67 –2.18 –1.72
12 0.0 25 –3.29 –2.45 –1.78 41 –2.70 –2.29 –1.76
0.025 24 –3.30 –2.50 –1.77 40 –2.70 –2.29 –1.77
13 0.0 30 –3.39 –2.73 –1.80 39 –2.80 –2.19 –1.83
0.025 29 –3.38 –2.73 –1.80 40 –2.80 –2.20 –1.83
14 0.0 31 –3.42 –2.89 –1.84 46 –2.87 –2.48 –1.91
0.025 30 –3.44 –2.82 –1.82 47 –2.89 –2.53 –1.89
(b) Dependencies on Euler angles
l emax
k c_cutoff l pmax = 0, beta = 0.2, O_cutoff = [infinity]
79 monomeric decoy sets 81 Ig decoy sets
No. of tops overline(ln Pe) overline(ln Pr) overline(Ze) No. of tops
4 0.0 25 –3.18 –2.68 –1.78 33 –2.63 –2.26 –1.31
0.025 25 –3.14 –2.71 –1.75 33 –2.61 –2.31 –1.29
5 0.0 25 –3.26 –2.79 –1.77 44 –2.85 –2.55 –1.65
0.025 26 –3.23 –2.80 –1.74 44 –2.84 –2.58 –1.61
6 0.0 26 –3.25 –2.79 –1.83 47 –3.04 –2.78 –1.84
0.025 24 –3.20 –2.57 –1.81 45 –3.00 –2.79 –1.77
7 0.0 30 –3.31 –2.84 –1.88 52 –3.03 –2.94 –1.82
0.025 28 –3.24 –2.70 –1.83 52 –3.02 –2.92 –1.73
First citation in article

Table III. Dependencies of the performance of fold recognition on theresolution of the orientational potential; interdependencies between polar and Eulerangles.
(a) Dependencies on l^max and cutoff O_cutoff
l pmax O_cutoff l emax = k = l, beta = 0.2, c_cutoff = 0.025
79 monomeric decoy sets 81 Ig decoysets
No. of tops overline(ln Pe) overline(ln Pr) overline(Ze) No. of tops
4 960 34 –3.72 –3.24 –2.18 47 –2.97 –2.81 –1.59
1792 36 –3.77 –3.27 –2.21 47 –3.01 –2.79 –1.67
5 960 36 –3.82 –3.38 –2.27 56 –3.18 –3.02 –1.81
1792 38 –3.87 –3.22 –2.33 55 –3.23 –2.92 –1.96
6 960 37 –3.83 –3.33 –2.32 60 –3.24 –3.23 –1.92
1792 37 –3.88 –3.22 –2.38 59 –3.27 –3.11 –2.00
2025 38 –3.85 –3.25 –2.36 56 –3.21 –3.05 –1.99
7 64 27 –3.53 –2.95 –1.93 30 –2.63 –2.04 –1.46
960 36 –3.85 –3.22 –2.34 57 –3.22 –3.11 –1.93
1792 38 –3.91 –3.31 –2.42 53 –3.20 –2.94 –2.02
2025 37 –3.87 –3.29 –2.40 54 –3.20 –3.02 –2.04
(b) Dependencies on cutoffc_cutoff
l emax = k = l pmax , beta = 0.2, O_cutoff = 960
l c_cutoff 79 monomeric decoy sets 81 Ig decoy sets
No. of tops overline(ln Pe) overline(ln Pr) overline(Ze) No. of tops
5 0.0 35 –3.81 –3.33 –2.27 55 –3.17 –2.96 –1.83
0.025 36 –3.82 –3.38 –2.27 56 –3.18 –3.02 –1.81
6 0.0 34 –3.80 –3.24 –2.32 60 –3.26 –3.25 –1.95
0.025 37 –3.83 –3.33 –2.32 60 –3.24 –3.23 –1.92
7 0.0 34 –3.82 –3.11 –2.33 59 –3.25 –3.17 –1.96
0.025 36 –3.85 –3.22 –2.34 57 –3.22 –3.11 –1.93
l pmax c_cutoff l emax = k = l, beta = 0.2, O_cutoff = 1792
5 0.0 38 –3.88 –3.30 –2.34 56 –3.23 –2.93 –1.96
0.025 38 –3.87 –3.22 –2.33 55 –3.23 –2.92 –1.96
6 0.0 37 –3.87 –3.35 –2.40 60 –3.28 –3.14 –2.01
0.025 37 –3.88 –3.22 –2.38 59 –3.27 –3.11 –2.00
7 0.0 39 –3.92 –3.27 –2.43 55 –3.20 –3.05 –2.05
0.025 38 –3.91 –3.31 –2.42 53 –3.20 –2.94 –2.02
(c) Dependencies on a parameterfor small sample correction, beta
l pmax = l emax = k = 6, c_cutoff = 0.025
O_cutoff beta 79monomeric decoy sets 81 Ig decoy sets
No. of tops overline(ln Pe) overline(ln Pr) overline(Ze) No. of tops
960 0.1 35 –3.82 –3.26 –2.32 60 –3.25 –3.23 –1.93
0.2 37 –3.83 –3.33 –2.32 60 –3.24 –3.23 –1.92
1 34 –3.78 –3.23 –2.28 58 –3.22 –3.19 –1.89
1792 0.1 36 –3.86 –3.15 –2.39 59 –3.27 –3.11 –2.00
0.2 37 –3.88 –3.22 –2.38 59 –3.27 –3.11 –2.00
1 36 –3.85 –3.18 –2.34 57 –3.24 –3.05 –1.97
First citation in article

Table IV. Performance of each potential component infold recognition.
(a) For the 79 monomeric decoy sets
Potentials^a No. of top ranks Mean Mean Mean Mean Median Median Mean
e rrc Delta e ijc e^o e^r e^s Total No. = 79 overline(ln Pe) overline(ln Pr) overline(Ze) overline(Zrmsd) Z_e Z_rmsd overline(R) ^b
e^o 37 –3.88 –3.22 –2.38 –2.49 –2.09 –1.65 0.33
e^o + e^r 35 –3.79 –3.08 –2.32 –2.33 –2.01 –1.49 0.33
e^o + e^s 53 –4.00 –3.99 –2.96 –3.13 –3.22 –2.59 0.35
e^o + e^r + e^s 53 –3.98 –3.99 –2.93 –3.13 –3.16 –2.59 0.34
Delta e^c 36 –4.12 –3.20 –2.56 –2.12 –2.37 –1.63 0.33
Delta e^c + e^r 41 –3.90 –3.12 –2.23 –2.03 –2.04 –1.74 0.32
Delta e^c + e^o 52 –4.53 –4.24 –3.18 –3.19 –2.79 –2.60 0.37
Delta e^c + e^o + e^r 52 –4.38 –4.04 –2.95 –3.01 –2.54 –2.50 0.37
Delta e^c + e^o + e^s 58 –4.25 –4.30 –3.51 –3.38 –3.48 –3.04 0.37
Delta e^c + e^o + e^r + e^s 57 –4.15 –4.24 –3.35 –3.35 –3.17 –2.80 0.37
e rrc + Delta e^c 36 –4.05 –3.29 –2.68 –2.32 –2.61 –1.86 0.32
e + Delta e^c + e^r 38 –4.18 –3.50 –2.53 –2.50 –2.49 –2.14 0.32
e + Delta e^c + e^o 58 –4.79 –4.88 –4.38 –3.92 –4.08 –3.55 0.40
e rrc + Delta e^c + e^o + e^r 57 –4.73 –4.69 –4.13 –3.74 –3.76 –3.41 0.40
e + Delta e^c + e^o + e^s 61 –4.63 –4.63 –4.45 –3.68 –4.11 –3.41 0.39
e + Delta e^c + e^o + e^r + e^s 59 –4.49 –4.49 –4.21 –3.56 –3.86 –3.10 0.39
(b) For the 81 immunogloblin decoy sets
Potentials^a No. of top ranks Mean Mean Mean Mean Median Median Mean
e rrc Delta e ijc e^o e^r e^s Total No. = 81 overline(ln Pe) overline(ln Pr) overline(Ze) overline(Zrmsd) Z_e Z_rmsd overline(R) ^b
e^o 59 –3.27 –3.11 –2.00 –2.74 –2.03 –2.55 0.38
e^o + e^r 62 –3.35 –3.23 –2.15 –2.85 –2.27 –2.61 0.36
e^o + e^s 67 –3.36 –3.42 –3.14 –3.00 –3.27 –2.69 0.39
e^o + e^r + e^s 68 –3.38 –3.46 –3.29 –3.03 –3.44 –2.71 0.37
Delta e^c 6 –1.55 –1.38 –0.52 –0.65 –0.51 –0.47 0.38
Delta e^c + e^r 36 –2.78 –2.29 –1.02 –1.70 –0.95 –1.15 0.29
Delta e^c + e^o 57 –3.20 –3.09 –1.57 –2.70 –1.55 –2.53 0.44
Delta e^c + e^o + e^r 63 –3.39 –3.35 –1.82 –2.95 –1.79 –2.67 0.40
Delta e^c + e^o + e^s 68 –3.36 –3.50 –2.53 –3.09 –2.44 –2.69 0.43
Delta e^c + e^o + e^r + e^s 69 –3.39 –3.52 –2.81 –3.09 –2.81 –2.71 0.40
e rrc + Delta e^c 0 –0.40 –1.33 0.54 –0.46 0.44 –0.49 0.35
e + Delta e^c + e^r 0 –0.44 –1.29 0.35 –0.50 0.24 –0.49 0.32
e + Delta e^c + e^o 19 –2.11 –2.08 –0.86 –1.26 –0.89 –0.79 0.50
e + Delta e^c + e^o + e^r 44 –2.82 –2.81 –1.20 –2.22 –1.25 –2.13 0.48
e rrc + Delta e^c + e^o + e^s 55 –3.00 –3.10 –1.83 –2.63 –1.94 –2.53 0.49
e + Delta e^c + e^o + e^r + e^s 61 –3.24 –3.31 –2.25 –2.82 –2.34 –2.61 0.46
^a The orientationalenergies used above are calculated with l pmax = l emax = k = 6, O_cutoff = 1792, beta = 0.2, andc_cutoff = 0.025.
^b R is the correlation coefficient of rank orderbetween the energies and RMSDs of decoys in a decoyset.
First citation in article

Table V. The performance of scoring functions for each familyof protein decoy sets.
Decoy ID range, decoy family
potentials No.of tops
/Total No. Mean
overline(ln Pe) Mean
overline(Ze) Mean
overline(R) ^a
1-7 4state_reduced: seven decoysets
 (e rrc + Delta e^c + e^o + e^s)^b 7/7 –6.50 –4.44 0.66
Fain et al. (2002)^c 1/7 –4.45 –2.3 0.52
Toby and Elber (2000)^d 3/6 –5.42 –3.14
Samudrala and Moult (1998)^e 6/7 –6.06 –2.67 0.67
Onizuka et al.(2002)^f 7/7 –6.50 –3.41
Dominy and Brooks (2002)^g ~7/7 ~–6.5 –3.4 0.55
8–11 fisa: four decoy sets
 (e rrc + Delta e^c + e^o + e^s)^b 2/4 –4.04 –2.55 0.26
Toby and Elbner (2000)^d 2/3 –3.34
Onizuka et al.(2002)^f 1/3 –1.38
12–16 fisa_casp3: five decoy sets
 (e + Delta e^c + e^o + e^s)^b 2/5 –5.38 –3.61 0.16
Toby and Elber (2000)^d 1/3 –3.94
Onizukaet al. (2002)^f 1/3 –2.01
17–45 hg_structal: 29 decoy sets
 (e rrc + Delta e^c + e^o + e^s)^b 22/29 –2.76 –2.62 0.72
Dominy and Brooks (2002)^g 19/29 –2.0 0.69
46–53 lattice_ssfit: eight decoy sets
 (e + Delta e^c + e^o + e^s)^b 8/8 –7.60 –11.12 –0.01
Fain et al. (2002)^c 8/8 –7.60 –6.84
Toby and Elber(2000)^d 4/6 –6.89 –4.10
Samudrala and Moult (1998)^e 8/8 –7.60 –6.46
Onizuka et al. (2002)^f 6/6 –7.60 –6.22
54–63 lmds: tendecoy sets
 (e rrc + Delta e^c + e^o + e^s)^b 8/10 –4.89 –5.34 0.14
Fain et al. (2002)^c 3/9 –4.55 –2.83
Toby and Elber (2000)^d 4/7 –5.32 –3.27
Samudrala and Moult (1998)^e 3/9 –3.04 –0.58
Onizuka et al. (2002)^f 5/7 –5.00 –3.67
64–73 lmds_v2: ten decoy sets
 (e rrc + Delta e^c + e^o + e^s)^b 8/10 –3.85 –5.03 0.18
Fain et al. (2002)^c 1/2 –4.81 –3.15
Samudralaand Moult (1998)^e 1/2 –4.47 –3.05
74–79 semfold: six decoysets
 (e + Delta e^c + e^o + e^s)^b 4/6 –8.13 –3.86 0.08
1–61 ig_structal: 61 dcoysets
 (e^o + e^r + e^s)^b 49/61 –3.55 –2.96 0.36
62–81 ig_structal_hires: 20 decoysets
 (e^o + e^r + e^s)^b 19/20 –2.86 –4.31 0.43
^a R is thecorrelation coefficient of rank order between the energies and RMSDsof decoys in a decoy set.
^b The present model;the orientational energies were calculated with l pmax = l emax = k = 6, O_cutoff = 1792, beta = 0.2, andc_cutoff = 0.025.
^c Reference 25.
^d Reference 24.
^e Reference 13;taken from Ref. 25.
^f Reference 33; the distance-dependent angularpotential named 3C326.
^g Reference 18; generalized Born, Coulomb, nonpolarsolvation, and van der Waals energy terms are included.
First citation in article

FOOTNOTES

^aElectronic mail:miyazawa@smlab.sci.gunma-u.ac.jp; URL: https://www.smlab.sci.gunma-u.ac.jp/^~miyazawa/

^bElectronic mail: jernigan@iastate.edu; URL: http://ribosome.bb.iastate.edu/

Up: Issue Table of Contents
Go to: Previous Article | Next Article
Other formats: HTML (smaller files) | PDF (245 kB)

Table I. Orientationalentropy, –ln f_aa, in k_B units for each residue pair (a,a);a (a) is shown in each row (column), r is forall types of residues, and the parameters used are l = l = k = 6,O_cutoff = 1792, = 0.2, and c_cutoff = 0.025.
	C	M	F	I	L	V	W	Y	A	G	T	S	Q	N	E	D	H	R	K	P	r
C	3.97	4.06	4.52	4.31	4.54	4.33	3.62	4.33	4.38	4.74	4.40	4.43	4.02	4.25	3.96	4.00	3.96	4.26	4.01	4.50	5.12
M	4.07	4.47	4.69	4.44	4.58	4.45	4.23	4.64	4.50	4.88	4.48	4.57	4.24	4.42	4.15	4.16	4.21	4.35	4.04	4.78	4.97
F	4.51	4.71	4.92	4.73	4.88	4.68	4.55	4.86	4.84	5.09	4.82	4.83	4.51	4.82	4.60	4.60	4.60	4.67	4.50	4.90	5.16
I	4.31	4.45	4.72	4.38	4.52	4.34	4.42	4.66	4.36	4.91	4.47	4.57	4.27	4.47	4.13	4.27	4.34	4.44	4.10	4.82	4.77
L	4.53	4.57	4.88	4.52	4.68	4.55	4.60	4.78	4.43	5.01	4.62	4.64	4.35	4.65	4.20	4.41	4.68	4.56	4.28	5.06	4.86
V	4.31	4.46	4.69	4.33	4.55	4.21	4.53	4.65	4.33	4.90	4.44	4.55	4.43	4.60	4.22	4.28	4.43	4.48	4.16	4.80	4.78
W	3.59	4.23	4.53	4.43	4.59	4.53	3.87	4.46	4.78	4.79	4.46	4.51	4.06	4.27	4.29	4.40	4.09	4.28	4.01	4.56	5.21
Y	4.34	4.61	4.85	4.63	4.74	4.62	4.44	4.87	4.85	5.11	4.78	4.80	4.46	4.86	4.76	4.91	4.71	4.66	4.38	4.88	5.23
A	4.34	4.50	4.85	4.33	4.42	4.29	4.76	4.85	3.76	4.88	4.46	4.45	4.37	4.52	4.10	4.05	4.60	4.53	4.20	4.96	4.78
G	4.70	4.88	5.12	4.89	4.98	4.88	4.84	5.13	4.88	5.47	5.12	5.31	5.00	5.30	4.90	4.95	5.06	5.22	4.97	5.35	5.61
T	4.37	4.46	4.82	4.44	4.62	4.44	4.44	4.80	4.46	5.13	4.23	4.54	4.19	4.63	3.95	4.16	4.52	4.62	4.16	4.91	4.95
S	4.42	4.56	4.87	4.56	4.62	4.54	4.50	4.82	4.41	5.30	4.54	4.67	4.42	4.78	4.24	4.33	4.59	4.76	4.48	4.98	5.09
Q	4.02	4.20	4.51	4.21	4.31	4.38	4.07	4.47	4.36	5.02	4.19	4.39	4.15	4.39	3.84	4.03	4.32	4.27	3.91	4.72	4.86
N	4.23	4.41	4.84	4.48	4.61	4.58	4.30	4.85	4.52	5.28	4.65	4.77	4.39	4.84	4.28	4.45	4.59	4.71	4.36	4.97	5.22
E	3.96	4.12	4.59	4.12	4.19	4.18	4.29	4.81	4.10	4.93	3.95	4.22	3.81	4.29	3.72	3.83	4.58	4.39	4.06	4.54	4.71
D	3.96	4.14	4.61	4.24	4.38	4.28	4.42	4.95	4.06	4.95	4.14	4.32	4.03	4.44	3.83	4.13	4.71	4.85	4.46	4.67	4.95
H	3.98	4.20	4.58	4.33	4.66	4.43	4.09	4.73	4.60	5.07	4.51	4.53	4.30	4.60	4.58	4.71	4.40	4.44	4.18	4.63	5.27
R	4.26	4.36	4.68	4.42	4.55	4.46	4.31	4.72	4.54	5.25	4.63	4.75	4.27	4.73	4.37	4.87	4.47	4.66	4.05	4.88	5.08
K	3.97	4.06	4.51	4.09	4.26	4.15	3.99	4.42	4.22	5.00	4.19	4.49	3.94	4.38	4.06	4.48	4.18	4.07	3.85	4.53	4.81
P	4.47	4.76	4.94	4.80	5.06	4.79	4.59	4.91	4.97	5.35	4.89	4.96	4.76	5.00	4.58	4.66	4.68	4.89	4.54	5.19	5.48
r	5.11	4.97	5.15	4.77	4.86	4.77	5.21	5.24	4.78	5.61	4.96	5.09	4.88	5.23	4.72	4.96	5.26	5.08	4.81	5.48	5.18

Table II. Dependencies of the performanceof fold recognition on the resolution of the orientational potential;dependencies on polar or Euler angles.
(a) Dependencies on polarangles
l	c_cutoff	l = k = 0, = 0.2, O_cutoff =
		79 monomeric decoy sets				81 Ig decoy sets
		No. of tops				No. of tops
4	0.0	23	–2.79	–2.09	–1.41	29	–2.66	–1.88	–1.45
	0.025	22	–2.77	–2.02	–1.41	28	–2.67	–1.82	–1.45
5	0.0	31	–3.35	–2.57	–1.84	31	–2.68	–1.96	–1.46
	0.025	31	–3.37	–2.57	–1.84	30	–2.66	–1.93	–1.45
6	0.0	27	–3.23	–2.55	–1.77	34	–2.69	–2.19	–1.45
	0.025	28	–3.24	–2.58	–1.76	34	–2.68	–2.16	–1.44
7	0.0	30	–3.45	–2.60	–1.98	45	–2.93	–2.52	–1.57
	0.025	31	–3.46	–2.60	–1.98	45	–2.94	–2.53	–1.58
8	0.0	28	–3.37	–2.59	–1.91	38	–2.73	–2.24	–1.48
	0.025	27	–3.36	–2.55	–1.89	39	–2.74	–2.27	–1.49
9	0.0	25	–3.38	–2.43	–1.92	32	–2.66	–2.06	–1.54
	0.025	24	–3.36	–2.44	–1.90	33	–2.68	–2.08	–1.56
10	0.0	27	–3.32	–2.55	–1.83	37	–2.55	–2.13	–1.52
	0.025	26	–3.31	–2.49	–1.82	36	–2.52	–2.14	–1.55
11	0.0	28	–3.44	–2.67	–1.94	39	–2.68	–2.16	–1.71
	0.025	30	–3.48	–2.82	–1.92	39	–2.67	–2.18	–1.72
12	0.0	25	–3.29	–2.45	–1.78	41	–2.70	–2.29	–1.76
	0.025	24	–3.30	–2.50	–1.77	40	–2.70	–2.29	–1.77
13	0.0	30	–3.39	–2.73	–1.80	39	–2.80	–2.19	–1.83
	0.025	29	–3.38	–2.73	–1.80	40	–2.80	–2.20	–1.83
14	0.0	31	–3.42	–2.89	–1.84	46	–2.87	–2.48	–1.91
	0.025	30	–3.44	–2.82	–1.82	47	–2.89	–2.53	–1.89
(b) Dependencies on Euler angles
l k	c_cutoff	l = 0, = 0.2, O_cutoff =
		79 monomeric decoy sets				81 Ig decoy sets
		No. of tops				No. of tops
4	0.0	25	–3.18	–2.68	–1.78	33	–2.63	–2.26	–1.31
	0.025	25	–3.14	–2.71	–1.75	33	–2.61	–2.31	–1.29
5	0.0	25	–3.26	–2.79	–1.77	44	–2.85	–2.55	–1.65
	0.025	26	–3.23	–2.80	–1.74	44	–2.84	–2.58	–1.61
6	0.0	26	–3.25	–2.79	–1.83	47	–3.04	–2.78	–1.84
	0.025	24	–3.20	–2.57	–1.81	45	–3.00	–2.79	–1.77
7	0.0	30	–3.31	–2.84	–1.88	52	–3.03	–2.94	–1.82
	0.025	28	–3.24	–2.70	–1.83	52	–3.02	–2.92	–1.73

Table III. Dependencies of the performance of fold recognition on theresolution of the orientational potential; interdependencies between polar and Eulerangles.
(a) Dependencies on l^max and cutoff O_cutoff
l	O_cutoff	l = k = l, = 0.2, c_cutoff = 0.025
		79 monomeric decoy sets				81 Ig decoysets
		No. of tops				No. of tops
4	960	34	–3.72	–3.24	–2.18	47	–2.97	–2.81	–1.59
	1792	36	–3.77	–3.27	–2.21	47	–3.01	–2.79	–1.67
5	960	36	–3.82	–3.38	–2.27	56	–3.18	–3.02	–1.81
	1792	38	–3.87	–3.22	–2.33	55	–3.23	–2.92	–1.96
6	960	37	–3.83	–3.33	–2.32	60	–3.24	–3.23	–1.92
	1792	37	–3.88	–3.22	–2.38	59	–3.27	–3.11	–2.00
	2025	38	–3.85	–3.25	–2.36	56	–3.21	–3.05	–1.99
7	64	27	–3.53	–2.95	–1.93	30	–2.63	–2.04	–1.46
	960	36	–3.85	–3.22	–2.34	57	–3.22	–3.11	–1.93
	1792	38	–3.91	–3.31	–2.42	53	–3.20	–2.94	–2.02
	2025	37	–3.87	–3.29	–2.40	54	–3.20	–3.02	–2.04
(b) Dependencies on cutoffc_cutoff
		l = k = l, = 0.2, O_cutoff = 960
l	c_cutoff	79 monomeric decoy sets				81 Ig decoy sets
l	c_cutoff	No. of tops				No. of tops
5	0.0	35	–3.81	–3.33	–2.27	55	–3.17	–2.96	–1.83
	0.025	36	–3.82	–3.38	–2.27	56	–3.18	–3.02	–1.81
6	0.0	34	–3.80	–3.24	–2.32	60	–3.26	–3.25	–1.95
	0.025	37	–3.83	–3.33	–2.32	60	–3.24	–3.23	–1.92
7	0.0	34	–3.82	–3.11	–2.33	59	–3.25	–3.17	–1.96
	0.025	36	–3.85	–3.22	–2.34	57	–3.22	–3.11	–1.93
l	c_cutoff	l = k = l, = 0.2, O_cutoff = 1792
5	0.0	38	–3.88	–3.30	–2.34	56	–3.23	–2.93	–1.96
	0.025	38	–3.87	–3.22	–2.33	55	–3.23	–2.92	–1.96
6	0.0	37	–3.87	–3.35	–2.40	60	–3.28	–3.14	–2.01
	0.025	37	–3.88	–3.22	–2.38	59	–3.27	–3.11	–2.00
7	0.0	39	–3.92	–3.27	–2.43	55	–3.20	–3.05	–2.05
	0.025	38	–3.91	–3.31	–2.42	53	–3.20	–2.94	–2.02
(c) Dependencies on a parameterfor small sample correction,
		l = l = k = 6, c_cutoff = 0.025
O_cutoff		79monomeric decoy sets				81 Ig decoy sets
O_cutoff		No. of tops				No. of tops
960	0.1	35	–3.82	–3.26	–2.32	60	–3.25	–3.23	–1.93
	0.2	37	–3.83	–3.33	–2.32	60	–3.24	–3.23	–1.92
	1	34	–3.78	–3.23	–2.28	58	–3.22	–3.19	–1.89
1792	0.1	36	–3.86	–3.15	–2.39	59	–3.27	–3.11	–2.00
	0.2	37	–3.88	–3.22	–2.38	59	–3.27	–3.11	–2.00
	1	36	–3.85	–3.18	–2.34	57	–3.24	–3.05	–1.97

Table V. The performance of scoring functions for each familyof protein decoy sets.
Decoy ID range, decoy family potentials	No.of tops /Total No.	Mean	Mean	Mean ^a
1-7 4state_reduced: seven decoysets
(e + e^c + e^o + e^s)^b	7/7	–6.50	–4.44	0.66
Fain et al. (2002)^c	1/7	–4.45	–2.3	0.52
Toby and Elber (2000)^d	3/6	–5.42	–3.14
Samudrala and Moult (1998)^e	6/7	–6.06	–2.67	0.67
Onizuka et al.(2002)^f	7/7	–6.50	–3.41
Dominy and Brooks (2002)^g	~7/7	~–6.5	–3.4	0.55
8–11 fisa: four decoy sets
(e + e^c + e^o + e^s)^b	2/4	–4.04	–2.55	0.26
Toby and Elbner (2000)^d	2/3		–3.34
Onizuka et al.(2002)^f	1/3		–1.38
12–16 fisa_casp3: five decoy sets
(e + e^c + e^o + e^s)^b	2/5	–5.38	–3.61	0.16
Toby and Elber (2000)^d	1/3		–3.94
Onizukaet al. (2002)^f	1/3		–2.01
17–45 hg_structal: 29 decoy sets
(e + e^c + e^o + e^s)^b	22/29	–2.76	–2.62	0.72
Dominy and Brooks (2002)^g	19/29		–2.0	0.69
46–53 lattice_ssfit: eight decoy sets
(e + e^c + e^o + e^s)^b	8/8	–7.60	–11.12	–0.01
Fain et al. (2002)^c	8/8	–7.60	–6.84
Toby and Elber(2000)^d	4/6	–6.89	–4.10
Samudrala and Moult (1998)^e	8/8	–7.60	–6.46
Onizuka et al. (2002)^f	6/6	–7.60	–6.22
54–63 lmds: tendecoy sets
(e + e^c + e^o + e^s)^b	8/10	–4.89	–5.34	0.14
Fain et al. (2002)^c	3/9	–4.55	–2.83
Toby and Elber (2000)^d	4/7	–5.32	–3.27
Samudrala and Moult (1998)^e	3/9	–3.04	–0.58
Onizuka et al. (2002)^f	5/7	–5.00	–3.67
64–73 lmds_v2: ten decoy sets
(e + e^c + e^o + e^s)^b	8/10	–3.85	–5.03	0.18
Fain et al. (2002)^c	1/2	–4.81	–3.15
Samudralaand Moult (1998)^e	1/2	–4.47	–3.05
74–79 semfold: six decoysets
(e + e^c + e^o + e^s)^b	4/6	–8.13	–3.86	0.08
1–61 ig_structal: 61 dcoysets
(e^o + e^r + e^s)^b	49/61	–3.55	–2.96	0.36
62–81 ig_structal_hires: 20 decoysets
(e^o + e^r + e^s)^b	19/20	–2.86	–4.31	0.43
^a R is thecorrelation coefficient of rank order between the energies and RMSDsof decoys in a decoy set.
^b The present model;the orientational energies were calculated with l = l = k = 6, O_cutoff = 1792, = 0.2, andc_cutoff = 0.025.
^c Reference 25.
^d Reference 24.
^e Reference 13;taken from Ref. 25.
^f Reference 33; the distance-dependent angularpotential named 3C326.
^g Reference 18; generalized Born, Coulomb, nonpolarsolvation, and van der Waals energy terms are included.

How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins?

Sanzo Miyazawaa)

Faculty of Technology, Gunma University, Kiryu, Gunma 376-8515, JapanLaurence H. Baker Center for Bioinformatics and Biological Statistics, Plant Sciences Institute, Iowa State University, Ames, Iowa 50011-3020

Robert L. Jerniganb)

Laurence H. Baker Center for Bioinformatics and Biological Statistics, Plant Sciences Institute, Iowa State University, Ames, Iowa 50011-3020Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020

Contents

I. INTRODUCTION

II. METHODS

A. Coarse-grained conformational energy

B. Contact potentials

C. Residue-residue orientational potential

D. Repulsive potentials

E. Short-range potentials

F. Datasets of protein structures used

III. RESULTS

A. Local coordinate system affixed to each residue

B. Orientational distributions of contacting residues

C. Distributions of residue orientations depend significantly on Euler angles

D. Recognition power for native structures

E. Evaluation of the performance of potential functions in fold recognition

F. How important are the Euler angle dependencies of relative residue orientations for fold recognition?

G. How important are relative orientations between residues in fold recognition?

H. Comparison of the performance of the present potential function with other potentials

IV. DISCUSSION

REFERENCES

FIGURES

TABLES

FOOTNOTES

How effectivefor fold recognition is a potential of mean force thatincludes relative orientations between contacting residues in proteins?

Sanzo Miyazawa^a)

Faculty of Technology,Gunma University, Kiryu, Gunma 376-8515, Japan
Laurence H. Baker Center for Bioinformaticsand Biological Statistics, Plant Sciences Institute, Iowa State University, Ames,Iowa 50011-3020

Robert L. Jernigan^b)

Laurence H. Baker Center for Bioinformatics and Biological Statistics,Plant Sciences Institute, Iowa State University, Ames, Iowa 50011-3020
Department of Biochemistry,Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020

A. Coarse-grained conformationalenergy

A. Local coordinatesystem affixed to each residue

C. Distributions of residue orientations depend significantly on Eulerangles

D.Recognition power for native structures

E.Evaluation of the performance of potential functions in fold recognition

F. Howimportant are the Euler angle dependencies of relative residue orientationsfor fold recognition?

G. How important are relative orientationsbetween residues in fold recognition?

H. Comparison of the performance of the present potentialfunction with other potentials