SA-HMM

Understanding and predicting protein structures depends on the complexity and the accuracy of the models used to represent them. An emerging concept of the stuctural biology is the identification of a Structural Alphabet i.e. of a limited number of recurrent structural elements of proteins, structural "letters", whose associations governed by logic rules form the words of protein structures. The definition of such an alphabet poses numerous technical problems, but provides a new tool for better understanding protein architecture. In addition, the applications possible from such alphabet are numerous and range from simplifying the protein backbone conformation with a correct accuracy to more ambitious prediction approaches. Hidden Markov Model (HMM) was used to discretize protein backbone conformation as series of overlapping fragments (states) of 4-residue length Camproux, 1999a. Using HMM allows moving the description of 3D structures from the sole geometric aspect towards a more explanatory description embedding local dependence information. This results in both a library of representative structural fragments and their first order dependency, called Structural Alphabet (SA). The question of how many different representative fragments can optimally decompose the whole set of short local conformations occurring in protein structures can be adress using a statistical criterion. An optimal systematic decomposition of the conformational variability of the protein peptidic chain is obtained in 27 states together with strong connection logic,stable over different protein sets. SA-HMM fits well the previous knowledge related to protein architecture organisationand seems able to grab some subtle details of protein organisation, such as helix sub-level organisation schemes. Encoding protein 3D coordinates into the fragments library or "structural alphabet space" (SA) allows to discretize the 3D architecture ofproteins in terms of a 1-dimensional SA, noted 1D' SA. Reconstruction 3D of proteins from 1D' SA space is validated on a set of proteinsand compared to recent approaches. Use of HMM allows to drastically reduce the complexity of the reconstruction. Although we use short fragments, the learning process on entire protein conformations captures the logic of the assembly on a larger scale. Using such a model, the structure of proteins can be reconstructed with an average accuracy close to 1.1 Amgstroms of RMSd and for a complexity of only 3. Finally, we also observe that sequence specificity increases with the number of states of the structural alphabet. Such models can constitute a very relevant approach to the analysis of protein architecture in particular for proteinstructure prediction.

Materials and Methods

Material

Describing three dimensional conformations

The structures are described as consecutive overlapping fragments of 4-residues so that they can be fitted together to form the backbone geometry Camproux, 1999b. Each fragment is described by a 4-descriptors vector: the three distances between the non consecutive alpha_carbons : d₁=d{C_alpha1-C_alpha3}, d₂=d{C_alpha1-C_alpha4}, d₃=d{C_alpha3-C_alpha4} where 1,..,4 denotes the 4 residues of a fragment, and the oriented projection P_4 of the last alpha-carbon C_alpha₄(h) to the plane formed by the three first ones. Two independent learning subsets of 250 proteins representing respectively a total of 56167 four-residues fragments (LS1) and 57544 four-residues fragments (LS2) were randomly selected.

Results

Identification of the optimal structural alphabet (SA)

Suppose we knew that polypeptidic chains are made up of representative fragments of R different types. We would then assume that there are (R) states of the model. Each state corresponding to a representative 4-residue fragment of the polypeptidic chains is associated with a multinormal function. Two models are considered: (i) a process without memory (order 0), assuming independence of the R states and identified by training simple finite Mixture Models (MM); (ii) a process with memory, i.e., a markovian model of order 1, where dependence between contiguous fragmentsidentified by training a HMM, see (Rabiner, 1989). Since HMM or MM give a quantification of the likelihood of the data in the terms of a model, it is possible to compare models using statistical criteria such as Bayesian Information Criterion (BIC), (Schwartz, 1978). Structural alphabets of different size (R), noted SA-R were learnt using HMM by progressively increasing R and compared using BIC
Combining HMM or MM with statistic criterion allows us to find the optimal library fragments size from parsimony point of view. The difference between the two types of models either HMM (Markovian dependence between the states) or MM (independence), compare on the basis of their goodness of fitis striking. For MM, no optimum is reached until alphabet sizes of 70 whereas for HMM, an optimum is reached for a number of states of 27 (SA-27) meaning a better fit of the data using HMM. Hence, learning prototype conformations considering not only the geometry of the fragments but also the way the fragments are concatenated (Markovian dependence) results in a simpler description, in the sense that the optimal description of protein conformations can be done using a much smaller number of prototype conformations. Similar results (Figure 1) were obtained using two independent learning sets of 250 proteins (referred to as LS1 and LS2). The optimum is reached for 27 states in both cases, and we find that the states identified from LS1 and LS2 are very similar in terms of averages, variabilities,
frequencies and transitions.

Optimal Structural Alphabet

Geometrical description of the SA

Conformational variability of 4-residue fragments of proteins is optimally described by a SA of 27 representative fragments, labelled as [A,B,...,Z, a] Camproux, in revision. They are sorted by increasing stretches of their corresponding fragments and generated using XmMol (Tuffery, 1995).

Logical rules within Structural Alphabet

Taking advantage of this local dependence reduces both the total number of representative fragments and number of possible representative fragments at each position (average of 8 states).

Structures encoding in SA space

Encoded 3D conformation proteins in the 1D' SA space corresponds to a compression of the 3D-protein space in the discrete SA space (1D'). aking into account dependence between fragments allows to perform this encoding globally and non position by position and to reduces the complexity of the reconstruction. Indeed, Viterbi algorithm (Baum, 1970; Rabiner, 1989) based on dynamic programming algorithm, finds the single best fragments series amongst all possible paths. Reconstruction of 3D conformation of proteins from 1D' SA space corresponds to the decompression step. Quality of the reconstruction step is evaluated on a sample of 200 proteins and shows very performent results considering very low complexity, compared to to recent approaches Micheletti (2000), Kolodny (2002). Using such a model, the structure of proteins can be reconstructed with an average accuracy close to 1.1 Amgstroms of RMSd and for a complexity of only 3.
Subsequently, SA provides some kind of compression from the 3D protein coordinate space into a 1D structural alphabet space. From such 1D encoding and the associated logical rules, it seems possible to tackle the exploration of three-dimensional protein conformations using 1D techniques. An facility coming with the use of HMM is that it is possible to quantify, during the SA encoding of proteins of known structures, the probability of substituting one local conformation by another. This allows to quantify the similarity of protein fragments encoded as different series of states. This widens the perspective of being able to work with a 1D representation of 3D structures much beyond the simple search of exact words, through the use of the classical 1D amino acid alignment methods.

Conformational learning correlates with sequence specificity

We have assessed the degree of specificity of the sequences associated with each state of SA-27.All the states have some significant amino acid sequence specificity compared to the profiles of the collection of 1429 protein fragments.Using a Kullback-Leibler asymmetric divergence measure, the weakest dependence, observed for [E], is still very significant (p<0.001). This implies that the distributions of the 4-uplets of amino acids associated with each state become more and more specific as the number of states increases. This confirms that, despite we have learnt SA using only geometric information, we have not over-split sequence information.

Conclusion

This approach can be used for analysing strings of conformational states that define protein structure in the same way that is done for strings of amino acid residues that define protein sequence. For instance to generate accurate loop conformations in homology modelling (Camproux, 2001), for homology search and for prediction ab initio from the discretize conformational space.

Hidden Markov Model Structural alphabet (SA-HMM) for the exploration of protein structures

Introduction