The short-range potential is a secondary structure potential based on peptide dihedral angles. All of these potentials are estimated as potentials of mean force from the observed distributions of residue-residue contacts and of peptide dihedral angles at the residue level in crystal structures of proteins. In the following, energy is represented in kBT units, where kB is the Boltzmann constant and T is temperature.
where ec(ri,rj) is the contact energy between the ith and jth residues, and ri represents all the atomic positions of the ith residue. The pairwise energy potential is represented as the sum of two terms, one of which is the usual contact potential2,3,4 and the other is a potential of mean force for relative orientations between contacting residues that is evaluated here from the statistical distribution of relative orientations,
where c(ri,rj) represents the degree of contact between the ith and jth residues, e
is the contact energy for residues of types ai and aj in contact, and eoaiaj(ri,rj) is the orientational energy for the relative direction and rotation between amino acids of type ai and aj contact; ai means the amino acid type of the ith residue. Here, it should be noted that the radial distance between residues is described by specifying whether or not these residues are in contact with each other, and that orientational interactions are assumed only for residues that are in contact with each other.
c(ri,rj) takes a value one for residues that are completely in contact, the value zero for residues that are too far from each other, and values between one and zero for residues whose distance is intermediate between those two extremes, about 6.5 Å between geometric centers of their side chain heavy atoms. Previously, this function was defined as a step function for simplicity. Here, it is defined as a switching function as follows; in the equation below to define residue contacts, ri means the position vector of a geometric center of side chain heavy atoms or the C
atom for GLY,
where Sw is a switching function, and r is the van der Waals radius of a residue of type a which is estimated from the average volume Va occupied by a residue of type a in protein structures with the packing density of hard sphere
; Va are those calculated in Refs. 46 and 47 and listed in Ref. 2. A critical distance to define a residue-residue contact is about 6.5 Å, but it is taken to be larger for bulky residues.
and a residue-type dependent term
e
; r means an average residue here.
The energies e
for all pairs of the 20 types of residues were recalculated44 from 2129 protein species representatives of the SCOP48 Release 1.53 with the sampling method3 and with the parameters evaluated in Miyazawa and Jernigan4 to correct these values estimated by the Bethe approximation; actually, the estimates of contact energies corrected for the Bethe approximation are divided by
0.263 defined in Eq. (34) of that paper4 and used as the values of
e
. In other words, the intrinsic pairwise interaction energies
eij are corrected relative to the hydrophobic energies
eir, and the hydrophobic energies are not corrected at all; see that paper4 for the exact definitions of
eij and
eir. This scheme is employed, so that all the energy potentials in Eq. (1) have magnitudes estimated as the potential of mean force from observed distributions by assuming a Boltzmann distribution.
is essential for a protein to fold by canceling out the large conformational entropy of extended conformations but it is difficult to estimate.2,3 The value –2.55 originally estimated2,3 for e
is used here; as a result, the contact energy e
takes a negative value for all amino acid pairs except for LYS-LYS pair.
,
) and Euler angles (
,
,
) to describe the direction and rotation of one residue relative to another, respectively. A local coordinate system fixed on each residue will be defined later. The potential of mean force for residue orientations is defined as
where faa(
,
,
,
,
) is a probability density function for a residue of type a
at the orientation (
,
,
,
,
) relative to the residue of type a; it satisfies
and fa
a:
The relationship in respect to the polar angles (,
) is not simple, but (
,
) can be uniquely calculated from (
,
,
,
,
). Thus, in principle, faa
and fa
a must be equal to each other:
However, in the present statistical estimation of the probability density, the relationship above would be approximately satisfied. Therefore, the potential is evaluated in the form of Eq. (11).
Here it is important to note that this term represents a reference state such that the expected value of the orientational energy for each type of contacting residue pair in the native structures is equal to zero. Thus, this orientational potential represents simply the suitability of a relative orientation between contacting residues, but does not represent at all whether a contact between residues is favorable or not. The latter is supposed to be represented in the present scheme by the usual contact energy e. The reference distribution of residue-residue orientations for these orientational potentials is the uniform distribution, and not the overall distribution for all types of amino acid pairs employed by others.33,34,35 Therefore, for residue pairs whose distributions coincide with the overall distribution, the latter potentials give always no preference but the present potentials give a preference. This is a desirable behavior for orientational potentials, because such an overall distribution of residue-residue orientations would not be an intrinsic characteristic of non-native conformations but rather of native structures of proteins.
,
,
,
,
) variables.
g is represented as
where Y is the normalized spherical harmonics function, P
is the associated Legendre function; P
with mp = 0 is the Legendre polynomial. Then, the coefficients in the expansion of Eq. (18) can be calculated from the observed density distribution by
Thus, the coefficient of the first constant term in Eq. (18) that corresponds to the uniform distribution is obvious;
function can be used,33 that is,
and then, the expansion coefficients are calculated as
where (µ,
µ,
µ,
µ,
µ) is a set of angles observed for the contact µ between residue types a and a
, and wµ is a weight for this contact. The summations in the equations above are over all contacts of amino acid types a versus a
. A contact between amino acid types a and a
is counted as one half of a contact for a versus a
and another half for a
versus a; Naa
+ Na
a is equal to the actual number of contacts between amino acid types a and a
. Thus, a weight wµ is equal to 0.5wc, where wc is a sampling weight for each protein that is described in the section "Datasets of protein structures used." In Eq. (24), residues are regarded to be in contact if the geometric centers of side chains or C
atoms for GLY are within 6.5 Å.
where is taken to be
in order to reduce statistical errors resulting from the small size of samples; in Eq. (31) is a parameter to be optimized. Equation (31) means that more samples are required to determine higher frequency modes. In Eq. (27), the first term becomes more effective than the second term in the limit of small numbers of Naa
, and inversely the second term becomes more effective than the first term in the limit of large numbers of Naa
.
and
where Ocutoff is a cutoff value for expansion terms.
where H is the Heaviside step function which takes a value of one for zero and positive values of the argument and is otherwise zero. Finally the estimate of the probability density faa(
,
,
,
,
) is cut off at sufficiently low and high values in such a way that its logarithm takes a value within an appropriate range; for example, –7
–ln faa
(
,
,
,
,
) + ln(c
g00000)
1.
, and a repulsive packing potential e
,
where Sw is defined by Eq. (7). The repulsive packing potentials e for the 20 types of residues are estimated from the observed distributions of the numbers of contacting residues in dense regions of protein structures by assuming a Boltzmann distribution.3 N(ai,n
) is the observed number of residues of type ai that are surrounded by n
residues in the database of protein structures. q
is a coordination number, which is defined as the maximum feasible number of contacting residues around a residue, for the amino acid of type ai.
in Eq. (40) is a small value (
= 10–6) that is added to avoid the divergence of the logarithm function. The observed distribution N(ai,n
) used here is one44 compiled from 2129 protein species representatives of the SCOP48 Release 1.53 with our sampling method.3
(
i,
i) over all residues:
For this secondary structure potential, a 10° mesh over (,
) space is used to count frequencies of amino acids observed in protein native structures, and this intraresidue potential e
for each amino acid type a is evaluated as
where Na(,
) is the number of amino acids of type a at (
,
) observed in protein native structures, and Na is their sum over the entire (
,
) space, that is, the number of amino acids of type a. The second term is a constant term that corresponds to a reference energy, so that the (
,
) energy expected for each type of residue in the native structures is equal to zero.
,
) used here is one44 compiled from 2129 protein species representatives of the SCOP48 Release 1.53 with the sampling method3 used to reduce the weights of contributions of structures having high sequence identity.
, all
,
/
,
+
, and multidomain proteins. Classes of membrane and cell surface proteins, small proteins, peptides, and designed proteins are not used. Proteins whose structures50 were determined by NMR or having stated resolutions worse than 2.5 Å are removed to assure that the quality of proteins used is high. Also, proteins whose coordinate sets consist either of only C
atoms, or include many unknown residues, or lack many atoms or residues, are removed. In addition, proteins shorter than 50 residues are also removed. As a result, the set of species representatives includes 4435 protein domains; this dataset is named here as dataset A.