Understanding and predicting protein structures depends on the
complexity and the accuracy of the models used to represent them. An
emerging concept of the stuctural biology is the identification of a
Structural Alphabet i.e. of a limited number of recurrent structural
elements of proteins, structural "letters", whose associations
governed by logic rules form the words of protein structures. The
definition of such an alphabet poses numerous technical problems, but
provides a new tool for better understanding protein architecture. In
addition, the applications possible from such alphabet are numerous and
range from simplifying the protein backbone conformation with a correct
accuracy to more ambitious prediction approaches.
Hidden Markov Model *(HMM)* was used to discretize protein backbone
conformation as series of overlapping fragments (states) of 4-residue
length Camproux, 1999a.
Using *HMM* allows
moving the description of 3D structures from the sole geometric aspect
towards a more explanatory description embedding local dependence
information. This results in both a library of
representative structural fragments and their first order dependency,
called Structural Alphabet (*SA*). The question
of how many different representative fragments can optimally decompose
the whole set of short local conformations occurring in protein
structures can be adress using a statistical criterion. An optimal
systematic decomposition of the conformational variability of the
protein peptidic chain is obtained in 27 states together with strong
connection logic,stable over different protein sets. *SA*-*HMM* fits well the
previous knowledge related to protein architecture organisationand
seems able to grab some subtle details of protein organisation, such as
helix sub-level organisation schemes. Encoding protein 3D
coordinates into the fragments library or "structural alphabet space"
(*SA*) allows to discretize the 3D architecture ofproteins in terms of a
1-dimensional *SA*, noted 1D' *SA*. Reconstruction 3D of proteins
from 1D' *SA* space is validated on a set of proteinsand compared to
recent approaches. Use of *HMM* allows to drastically reduce the
complexity of the reconstruction. Although we use short fragments, the
learning process on entire protein conformations captures the logic of
the assembly on a larger scale. Using such a model, the structure of
proteins can be reconstructed with an average accuracy close to
1.1 Amgstroms of RMSd and for a complexity of only 3. Finally, we also observe
that sequence specificity increases with the number of states of the
structural alphabet. Such models can constitute a very relevant
approach to the analysis of protein architecture in particular for
proteinstructure prediction.

The extraction of *SA* is performed
from a collection of 1429 non-redundant proteins structures with
no disorder presenting less than 30% sequence identitywith a
resolution better than 2.5 Amgstrom. In order to keep a balance between the
largest number of proteins selected for learning and the
representativity of the dataset, we have used the non-redundant
set presenting less than 30% sequence identity. Only proteins at least
30 amino acids long, having no chain breaks, obtained by X-ray.
This resulted in a collection of 1429 proteins with no disorder or
missing coordinates.

The structures are described as consecutive overlapping fragments of 4-residues so that they can be fitted together to form the backbone geometry Camproux, 1999b. Each fragment is described by a 4-descriptors vector: the three distances between the non consecutive alpha_carbons : d

Suppose we knew that
polypeptidic chains are made up of representative fragments of R
different types. We would then assume that there are (R) states of the
model. Each state corresponding to a representative 4-residue fragment
of the polypeptidic chains is associated with a multinormal function.
Two models are considered: (i) a process without memory (order 0),
assuming independence of the R states and identified by training simple
finite Mixture Models *(MM)*; (ii) a process
with memory, i.e., a markovian model of order 1, where dependence
between contiguous fragmentsidentified by training a *HMM*,
see (Rabiner, 1989). Since *HMM* or *MM* give a
quantification of the likelihood of the data in the terms of a model,
it is possible to compare models using statistical criteria such as
Bayesian Information Criterion (BIC), (Schwartz, 1978).
Structural alphabets of different size (R), noted *SA*-R were learnt using
*HMM* by progressively increasing R and compared
using BIC

Combining*HMM* or *MM* with statistic criterion allows us to find
the optimal library fragments size from parsimony point of view.
The difference between the two types of models either *HMM*
(Markovian dependence between the states) or *MM* (independence), compare
on the basis of their goodness of fitis striking. For *MM*, no optimum is
reached until alphabet sizes of 70 whereas for *HMM*, an optimum is
reached for a number of states of 27 (*SA*-27) meaning a better fit of
the data using *HMM*.
Hence, learning prototype conformations considering not only the
geometry of the fragments but also the way the fragments are
concatenated (Markovian dependence) results in a simpler description,
in the sense that the optimal description of protein conformations can
be done using a much smaller number of prototype conformations. Similar
results (Figure 1) were obtained using two independent learning sets of
250 proteins (referred to as LS1 and LS2). The optimum is reached
for 27 states in both cases, and we find that the states identified
from LS1 and LS2 are very similar in terms of averages, variabilities,

frequencies and transitions.

Combining

frequencies and transitions.

Conformational variability of 4-residue fragments of proteins is
optimally described by a *SA* of 27 representative fragments, labelled
as [A,B,...,Z, a] Camproux, in revision. They are sorted by increasing stretches of their corresponding fragments and generated using XmMol (Tuffery, 1995).

The correspondence between the states and the usual secondary structures motifs (as identified using the STRIDE software, Frishman, 1995), is presented. Note that we have a much larger number of states than the secondary structure categories, and that we describe the conformations of fragments of 4-residue length (i.e. local shapes). Also, usual secondary structures are defined according to geometry criteria, and not the logic underlying the succession of the conformations. Finally, it is possible that for example single helix-like conformations can be observed outside helices. Hence we do not expect a strict correspondence here. However, we observe some clear relationships. Canonical alpha-helices are mostly associated with states [A, a, V, W] (the shortest states), but some positive over-representation of states [Z, B, C, D] is also observed. In contrast, 3-10 helices are mostly described by state [B], and to a lesser extent by states [E, H]. Pi-helices - only 0.7% of the dataset - seem to correspond to state [C]. Extended structures are dispatched into five states [L, N, M, T, X]. beta-turns, are associated with one or few particular states. Turns VIII (3.1%) correspond to states [I, P, G] and turn II (3.2%) mostly to states [U, S]. Turn I' and turn II' correspond both to fuzziest state [F] and to its closest state [U]. Turns VIa and VIb, occurring few, correspond to state [F]. The remaining Turn I, Turn IV, as well as the Coils, correspond to more complex profiles of states, clearly different however. Turn I (8.3%) is associated with several states [Z, B, C, D, E, O, S, Q, P, H]. Turn IV (8.2%), a geometrically not well determined turn, correspond to states [R, Q, I, F, U, G], among which states [F, R] are the geometrically most variable states. The remaining coils (17.9%) are mostly described by states [G, Y, J, K, L] corresponding to rather extended conformations and by states [D, S, Q].

Taking advantage of this local dependence reduces both the total number of representative fragments and number of possible representative fragments at each position (average of 8 states).

Subsequently,

This approach can be used for
analysing strings of conformational states that define protein
structure in the same way that is done for strings of amino acid
residues that define protein sequence. For instance to generate
accurate loop conformations in homology modelling (Camproux, 2001), for homology search
and for prediction ab initio from the discretize conformational space.