3DMSS-Sites: 3D Maximal Sub Structure search

What is 3DMSS-Sites ?
3DMSS-Sites provides means to search for 3D sub structures between structures. The search is general: 3DMSS-Sites searches for similar "clouds" of coordinates between sets of coordinates, i.e it is independent on the order of the atoms of the structures. Atomic or residue type compatibilities can be defined to avoid non relevant pairings.
3DMSS-Sites relies on two different algorithms Escan [1] and CSR [2] that have different properties.
3DMSS-Sites query can be a protein structure, a 3D motif, a collection of motifs or the bank of non redundant motifs that have been extracted from the Catalytic Site Atlas. Thus, 3DMSS-Sites can search for occurrences of any known catalytic site geometrically compatible with a given protein structure.
Alternatively to this help, consider reading the 3DMSS-Sites FAQ (frequently asked questions)

Quick access to 3DMSS-Sites search.

1. History
2. Concepts
3. Usage

Services, ancillary services
Input/output formats
Parameters

4. Examples, sample tests

History:

1998: Escan and CSR algorithms are published.
2004: Escan and CSR algorithms are implemented as separated online services .
2005: 3DMSS v1.0. Combines Escan and original CSR algorithm.
2006: 3DMSS-Sites v1.5. Enhanced CSR version. 3DMSS is coupled with the collection of Catalytic sites identified in the Catalytic Site Atlas.

Concepts:

In the 3DMSS-Sites service, two different algorithms can be employed, both considers atomic coordinates without taking into account atoms order:

CSR: This algorithm performs iteratively the search of the maximal substructure of a collection of N atoms (the query) and a target struture (the target). In its original implementation (accessible as the CSR service), CSR will perform a general search for similar "clouds" of coordinates between two sets of coordinates. This algorithm has been recently revisited (and rewritten) (2006) as CSR(re) in order to consider in addition constraints on the types of the coordinates. CSR will perform iteratively against a collection of structures (the bank). However, each search against one particular target is independent. Also, we limit the number of atoms to 50000 atoms per protein. For such size the runs may be particularly long, and they may require a huge number of iterations.
Note on the concept of "Maximal Sub Structure": CSR computes the maximal common 3D motif totally independently of any parameter. A maximal common 3D motif always exists. For convenience, the user is allowed to input three parameters, namely the minimal number of atoms, the maximal RMS deviation and a tolerance (see the parameters section) to retain the final motif only when it satisfies the conditions associated with them. It is pointed out that none of these three user parameters is part of the original CSR algorithm core, so that the superpostion of the complete structures using the matches may contain atom-pairs distances lower than the tolerance that CSR had not included in the motif. However, the match returned by CSR is maximal according to CSR criterion, not necessarily in the terms of the maximal number of pairs matching the tolerance criterion.

In the 3DMSS-Sites service - but not in the CSR page, CSR corresponds to the (re)implementation (2006) of the original CSR algorithm that can supplement geometry search by filters on atom names and residue names (they are specified as regular expressions, more information is below). This version also performs a posteriori extension of the matches to identify pairs that further satisfy the tolerance criterion. Here (i) CSR(re) searches for the maximal substructure among the possible pairings satisfying the atomic type constraints. Then (ii) it post-processes the maximal substructure identified and attempts to enlarge it. CSR(re) will presently only return its best match, i.e. 1 match at maximum per pair motif/molecule of the bank to search for motifs in.
Escan: This algorithm considers both atomic coordinates and atom names or residue names to search occurrences of sub structures of a query in a collection of structures (the bank). It proceeds by building larger common sub structures from smaller ones. It is possible to search using patterns for atom or residue names. More information is below. In its essence, it can be considered as related to the geometry hashing techniques (for more information on these, see for instance Gold and Jackson, Nucleic Acids Res, 2006, 34, D231-234 and references).
Algorithm choice: Combining CSR(re) and Escan, 3DMSS-Sites can be used in various contexts as catalytic site search, side chain pattern search, particular spatial arrangements of loops distant in the sequence, or even in the field of small compounds. In the 3DMSS-Sites service, the "auto" mode lets the data drive the choice of the algorithm (Escan or CSR). Nevertheless, the user can force the method by choosing "Escan" or "CSR" modes. As a general rule, CSR is able to afford for the search of queries of large size among a bank. Escan is best suited for the identification of queries of small size in a large bank. In the 3DMSS-Sites service, both algorithms can considers atoms and residue names. Access to the individual original services CSR and Escan is granted on behalf of 3DMSS-Sites. In the CSR service, only atomic coordinates are considered.

More information about input/output formats: here. Specifications about multiPDB file format here
More information on parameters: here
Some examples here
References

Usage:

Access the search services:

1. Prepare your query.
(Not mandatory for 3DMSS-Sites)
Watch the format!
PDBpart can help!

2. Prepare your bank.
(Not mandatory for 3DMSS-Sites/CSR/Escan)
PDBPileUp can help!

3. Select an algorithm and run the search

3DMSS-Sites

Combine Escan and CSR(re) methods

Predefined motifs and bank of motifs of Catalytic Site Atlas

Coordinates plus atom/residue types

Access to former service versions remains temporarilly possible here, although we discourage using these versions.

CSR
Coordinates only

CSR1 (older version)

Escan
Coordinates plus atom/residue types

Escan1 (older version)

Ancillary services:

PDBPileUp: This service is to make easy the creation of specific banks. It will pile up the PDB files into the multi PDB format accepted by 3DMSS-Sites search algorithms.

PDBpart: This service is a help to define a query. It will apply a mask to residues of a PDB file to return only some residues.

PDBTM: This service allows to apply the 3D transformation matrices returned as REMARKs in the 3DMSS-Sites result files.

OpenBabel: Interconvertion of file formats.

Input/Output formats:

The input/output format is based on the PDB file format.

Input:

Query:

CSR query can be any PDB file containing ATOM/HETATM lines. Only the coordinates will be considered.
Escan query can be any PDB file containing ATOM/HETATM and REMARK lines. Some examples of queries can be found in example section

       The Escan REMARK line format is as follows:
       REMARK     keyword value
             <---> 5 blanks (mandatory)

       REMARKs apply to the previous ATOM or HETATM lines.
       Possible keywords are:
       keyword=WEIGHT. (note the dot)
            WEIGHT is the weight affected to the atom. By default (no weight specified) the weight is 1.
              Example:
               ATOM    371 OD1 ASP    51      -0.661 -1.665   1.498 0.000 0.000
               REMARK     WEIGHT. 2.00
       keyword=MATCH.
            MATCH is a UNIX regular expression specifying atom types matched by the coordinates:
                  . any symbol
                  * preceeding expression repeated 0 times or more
                  + preceeding expression repeated 1 times or more
                  | or
                  [] domain
               Example:
               ATOM    371 OD1 ASP    51      -0.661 -1.665   1.498 0.000 0.000
               REMARK     MATCH. O.*
                means that OD1 371 can match any oxygen. To specify a match against only OD or OE, one should use:
                REMARK     MATCH. OD|OE
                or
                REMARK     MATCH. O(D|E)

                If no match is specified, the ATOM lines only match atoms with the same name.
       keyword=RESMATCH.
                This is the pending of MATCH for residue names.
                Example:
               ATOM    371 OD1 ASP    51      -0.661 -1.665   1.498 0.000 0.000
               REMARK     RESMATCH. GLY|T.*
                implies OD1 371 can match any atom in GLY or any atom in a residue which name starts with T (e.g. THR, TYR, TRP).
                If no RESMATCH is specified, the atom can match any residue type (i.e the default is equivalent to RESMATCH. .*)
       It is possible to combine MATCH and RESMATCH:
       Example:
               ATOM    371 OD1 ASP    51      -0.661 -1.665   1.498 0.000 0.000
               REMARK     MATCH. O.*
               REMARK     RESMATCH. ASP
       OD1 371 can match any atom with name starting with O in an ASP.

       Note about PDB atom name for MATCH. and RESMATCH.

        PDB atom names are specified from column 13 to 16, i.e. many atom
        names begin with a space character and all atom names are 4 characters
        long (for example, alpha-carbon name is " CA ", peptidic nitrogen is " N  "
        In Escan, a conversion is done from these PDB atom names into an
        internal representation before matching rules are applied.
        You should know these conversion rules in order to use correct regular
        expression for the atom name specifications.

        Hereafter "WXYZ" refers to the 4 characters code of atom name in PDB
        (for example, for alpha-carbon atoms, "WXYZ" is " CA ")

        If the first character, "W", is in the [A-Z] range then
                Escan matchname is ":WX"
        else if "Y" is in the [A-Z] range
                Escan matchname is "XY"
        else
                Escan matchname is "X"

      Examples:

        PDB name      ->     Escan internal name
        " CA "        ->     "CA"         (alpha-carbon)
        "ZN "        ->     ":ZN"        (Zinc)
        "CA "        ->     ":CA"        (Calcium)
        " CO "        ->     "CO"
        " HA1"        ->     "HA"
        " OD1"        ->     "OD"
        " O1D"        ->     "O"
        "C1 "        ->     ":C1"

Bank:

Both CSR and Escan accept multiPDB files as bank. A multiPDB file is one file containing series of PDB files. Each individual entry MUST start with a HEADER lines and end with END line. To preserve OpenBabel compatibility, COMPND is also accepted instead of HEADER. For CSR, it is however important that the bank contains homogeneous data (e.g. protein trace only or protein heavy atoms, but not some traces and some full protein (all heavy atoms)).
The PDBPileUp utility can generate mutliPDB files easily from PDBIds or pile up PDB files iteratively.

Output:

3DMSS-Sites: a listing of the matches is given, together with a link to a result file, 3DMSS-Matches, containing all matches in a multiPDB formatted file. In this file, query motif/bank match coordinates are alternated). In addition, in the bank matches, REMARK MATRIX lines are inserted before the ATOM lines. They specify the transformation matrix that will transform the original coordinates in the bank to superimpose best the query. This is specified so that it is possible to re-fit other parts of the bank, or the complete bank entry instead of only the matching atoms, using the same transformation. The PDBTM utility accepts the transformation matrix generated by 3DMSS-Sites. User can either paste the matrix only, or the lines including the REMARK MATRIX strings.
Moreover, for each match in the output listing, a Jmol button is shown. It is a link to a Jmol visualization window of this motif/bank match. A link to a "joined file" with the motif and bank coordinates in a single PDB file is also given ("joinNN.pdb") that allows a download of this file to visualize the match "at home". Note: in the join.pdb file, the query is given as the chain A and the bank is given as the chain B (this is necessary for the 3DMSS-Sites server as this file is used also for Jmol visualization and Jmol does not allow to display two molecules). If Escan as been used in 3DMSS-Sites, in addition to the RMS of the match, a Score is given that is the number of atoms (weighted, see above). A "relative score to the query atoms" is also given, i.e. the weighted number of matching atoms divided by the total number of atoms of the query. If atoms are not weighted, this relative score can be seen as the proportion of query atoms that match a bank atom (if relative score is 1.0, all atoms of the query have a match).
Escan and CSR: Two PDB files are returned, using the multiPDB file format (see upper). They contains query matches and bank matches (best matches are ranked first).
CSR log file: Most of this file is for expert users only. HOWEVER, hits are reported on lines such as:

KIT =      12       3       1 ; N =     4     4 ; RMS , SUP :     0.567706 0.759076
                    Here, the N reports the number of atoms involved in a match found by CSR. It may not meet the requirements of the search parameters.
                   Accepted matches are followed by a series of lines starting with:
MATCH:     8     0.655     0.914
                   where 8 is the number of atoms paired in the hit, and 0.655     0.914 the RMSd and the tolerance.
                   For requests leading to no hit, the KIT lines may help to adjust the maximal number of atoms.
                   You may also think of increasing the number of iterations ...
                   You may also check carefully indications related to the CUTOFF VALUE TOO SMALL or TOO LARGE.

Search Parameters

Note: the documentation for the individual Escan and CSR services is presently left in this page, in order to provide some additional help understanding 3DMSS-Sites parameter choice.

3DMSS-Sites parameters:

Motif: this is the motif that will be searched. It can be either defined by the user (direct input, file upload) or selected from pre-defined frequent motifs such as the Catalytic sites, or some others. The format of the motif MUST follow the conventions defined in the Query:Input section.

Bank: this is where the motifs are searched. It can be any collection of PDB files.
Search parameters:

Search Method: you can select Escan, CSR or let 3DMSS-Sites choose for you, depending of the size of the motif searched.
Tolerance: The maximal distance between two paired atoms of a match (Angstroms). If unsure, leave a large value here.
Max RMSd: Maximum RMSd for a match (Angstroms). This is the maximal average deviation value over the paired atoms of the query and the match. Usual values on the order of 2 - 3 Angstroms.

3DMSS-Sites-CSR and CSR parameters:

Note: for CSR, the parameters but the cutoff distance and parameters related to the number of iterations are only used a posteriori, to filter among the solutions identified.
Otherwise, CSR would return a solution for each query against each member of the bank.

Distance Threshold / Cutoff distance: This is an internal parameter of CSR. It does not affect the result, but its correct assignment will increase performance by optimising internal space. However, inadequate cutoff distance value may require memory space larger than that internally reserved by the program, leading to its premature termination. Also, it may require a larger number of iterations for CSR to find the matches. As a rule of thumb, this value should be roughly near a bondlength. (1.5-2. for small compounds, 0.9-1.2 for proteins, 4.-5. for alpha carbons). The log file may contain information related to the adjustment of this value.
Min # atoms: This is the minimal number of pairs of atoms to accept a match. Matches involving less than this given number of pairs of atoms will not be considered. A large value will invalidate this parameter. For 3DMSS-Sites CSR, a value of -1 implies that we are looking for matches of the entire motif, i.e. matches of the size of the number of atoms of the motif. Smaller motifs even if identified, will not be returned.
Max RMSd: Maximum RMSd for a match (Angstroms). This is the maximal average deviation value over the paired atoms of the query and the match. Usual values on the order of 2 - 3 Angstroms.
Tolerance: The maximal distance between two paired atoms of a match (Angstroms). If unsure, leave a large value here.
#Drawings / Max # iterations: A stop criterion: the maximal number of iterations of the search. Recommended value: about 200 for small molecules (less than 100 atoms), about 2000 for a hundred to a thousand atoms, and 20000 and more for larger molecules. Note: For 3DMSS-Sites-CSR, defining atomic/residue type compatibilities largely reduces the number of iterations required.
Convergence # iterations: A stop criterion: the maximal number of iterations to find a better match. By default, CSR will perfor up to the maximal number of iterations. This parameter allows to stop the search faster. Most often, CSR will find solutions close to optimum in less than 5000 iterations. However, it is not possible to make sure CSR would not find a better solution at a greater number of iteration. Recommended values are on the order of 100-200 for small compounds, 500-2000 for medium size requests (100 to 1000 atoms), 2000-2000 for larger molecules. Values larger or equal to the maximal number of iterations will have no effect.

3DMSS-Sites-Escan and Escan parameters:

Initial radius / Initial search radius (Angstroms): The search is performed by consecutive steps: (i) an initial search for sub-parts of the query (ii) the assembly of the regions matching sub-parts of the query to progressively identify larger matches. The initial search radius allows to define the granularity of the initial search. Too large values (larger than 2 inter-atomic distances in the query) will correspond to the direct search of structures compatible with the query. It is strongly unadvised! Modify the reasonable default value of 10 angstroms only if you feel confident with what you do!
Min match size / Min # atoms: This is the minimal number of pairs of atoms to accept a match. Matches involving less than this given number of pairs of atoms will not be considered. A large value will invalidate this parameter.
Max # matches: Maximal number of matches returned, among all those found, sorted. The N best solutions are returned.
Maximal deviation: This is the maximal average deviation,in Angstroms,over the paired atoms of the query and the match.Only matches with RMS deviation values less than this value will be returned.
Tolerance: The maximal distance between two paired atoms of a match (Angstroms).Unlike for CSR, this parameter is used to shortcut the search.
Res match flag: On / off. This triggers the contraint on residue names for allowed parings.
Max # matches per query motifs: Unlike CSR, Escan can return more than one match per molecule searched. This limits the number of matches returned.

Examples

HIV protease (PR) is a dimer, unlike most other proteases in the Aspartyl Protease Family which are monomers. It contains fewer amino acid residues than bacterial or mammalian proteases and is the smallest of all retroviral proteases. In the 1A30 PDB structure, we have taken as a query the CG,OD1,OD2 atoms of aspartate 25 on each chain and CB,OG1 atoms of threonine 26 on each chain (i.e. query contains 10atoms) as a query for Escan. One must note that the clusters50 database is constituted by single chains of PDB structures, then one cannot find a site at the interface between two chain (as in HIV-1 protease), and we can only find aspartyl proteases with one domain. When atom names are taken into account in the search (default behavior), 24 of the first 30 matches are aspartyl proteases or hydrolases (and 32 of the first 40 matches). Interestingly, when only type of atom is taken into account, the first match is the 1B5F structure (native cardosin A), which is an aspartyl protease, the CB and OG1 of the 2nd threonine (green, in chain B) of 1A30 matches with the CB and OG atoms of serine 216 (yellow) of the cardosin (see figure).

HIV-1 protease catalytic site query:
(this motif is already defined in 3DMSS-Sites, but it can be copied/pasted in Escan/CSR input query field)

HEADER    COMPLEX (ASPARTIC PROTEASE/INHIBITOR)   27-JAN-98   1A30
TITLE     HIV-1 PROTEASE COMPLEXED WITH A TRIPEPTIDE INHIBITOR
ATOM    197 CG ASP A 25      15.491 27.364   6.131 1.00 16.91           C
ATOM    198 OD1 ASP A 25      15.413 27.271   4.884 1.00 18.26           O
ATOM    199 OD2 ASP A 25      14.999 26.534   6.915 1.00 19.87           O
ATOM    204 CB THR A 26      16.491 31.246   1.312 1.00 14.58           C
ATOM    205 OG1 THR A 26      16.917 30.009   0.732 1.00 14.02           O
ATOM    950 CG ASP B 25      15.325 25.224   1.561 1.00 14.53           C
ATOM    951 OD1 ASP B 25      15.806 25.931   2.467 1.00 16.12           O
ATOM    952 OD2 ASP B 25      14.535 24.296   1.789 1.00 21.36           O
ATOM    957 CB THR B 26      20.365 28.766   2.172 1.00 18.08           C
ATOM    958 OG1 THR B 26      19.571 29.418   3.168 1.00 18.28           O
END

Note: A bank containing only 5 Aspartyl Proteases to test with CSR the same catalytic site used for HIV-1 protease catalytic site can be accessed here. (For CSR, you must set the minimal number of atoms to 8, and to start let the other default values).
Using CSR, that does not take atomic type into account, good geometric matches are found, although not corresponding to the site. Such result is perfectly consistent with the goal of CSR. It illustrates the different scopes in which CSR and Escan should be used, as well as the extreme care required, when addressing a particular problem, to choose the correct algorithm, and to prepare both query and bank in a relevant manner. The probability that some correct geometric match occurs between a query of small size and proteins having several thousands of atoms is large. CSR is best suited for larger problems!

match in 1B5F:

1B5F
ATOM    248 CG ASP A 32      15.540 27.419   6.153 0.00 24.19
ATOM    249 OD1 ASP A 32      15.684 27.233   4.927 0.00 19.15
ATOM    250 OD2 ASP A 32      14.739 26.748   6.828 0.00 22.25
ATOM    255 CB THR A 33      16.193 31.216   1.412 0.00 15.94
ATOM    256 OG1 THR A 33      16.811 30.036   0.895 0.00 16.07
ATOM   1712 CG ASP A 215      15.375 25.247   1.499 0.00 19.27
ATOM   1713 OD1 ASP A 215      15.897 25.975   2.357 0.00 19.41
ATOM   1714 OD2 ASP A 215      14.602 24.312   1.764 0.00 18.85
ATOM   1719 CB SER A 216      20.398 28.679   2.094 0.00 21.56
ATOM   1720 OG SER A 216      19.674 29.195   3.203 0.00 23.20

We ran a test for Escan with atoms from the site of a serine protease, (PDB structure 1a3b). The atoms choosen in the site were ND1,CD2,CE1,NE2 of histidine 57, CG,OD1,OD2 of aspartate 102 and N,O,OG of serine 195. Then the query is composed of 10 atoms. When Escan is run in the default mode (i.e. atom names taken into account), in the 50 first matches, 44 are serine proteinases, hydrolases, blood clotting enzymes and "unknown fonction" (which is in fact a serine protease). In the first 100 matches, 61 are of the above types. One must note that there are 74 serine proteases in clusters50 database. enzymes and "unknown fonction". When the type of atom is specified - i.e. an oxygen atom match with any oxygen atom, a carbon atom match with any carbon atom, etc... -, 48 of the first 50 matches are of the above types (75 of the first 100 matches).

Serine protease catalytic site query:
(this motif is already defined in 3DMSS-Sites, but it can be copied/pasted in Escan input query field)

HEADER    1A3B HIS/ASP/SER CATALYTIC SITE
ATOM    551 ND1 HIS H 57       8.981 -8.152 15.830 1.00 17.35           N
ATOM    552 CD2 HIS H 57       9.053 -9.467 17.592 1.00 14.44           C
ATOM    553 CE1 HIS H 57      10.246 -8.415 16.098 1.00 20.68           C
ATOM    554 NE2 HIS H 57      10.307 -9.197 17.145 1.00 21.37           N
ATOM   1043 CG ASP H 102       7.273 -5.970 13.672 1.00 14.00           C
ATOM   1044 OD1 ASP H 102       6.614 -5.729 14.754 1.00 13.43           O
ATOM   1045 OD2 ASP H 102       8.266 -6.793 13.690 1.00 16.16           O
ATOM   1795 N   SER H 195      15.641 -7.104 17.788 1.00 14.34           N
ATOM   1798 O   SER H 195      14.090 -5.235 19.221 1.00 15.16           O
ATOM   1800 OG SER H 195      13.458 -8.838 16.720 1.00 17.11           O
END

An Escan test on a site for a Zinc fixation site (a tetrahedron of four cysteines) shows that all matches are zinc fixation sites. The query here was composed only with the SG atoms of the 4 cysteines. All types of N, O and S atoms where allowed matches for the four SG atoms. Interestingly, the first 100 matches, 26 structures have at least one ND1 or NE2 atoms of histidine replacing a SG atoms.

Zinc fixation site query:
(this motif is already defined in 3DMSS-Sites, but it can be copied/pasted in Escan input query field)

HEADER BINDING SITE OF ZN IN 1HSZ                            1hsz
ATOM    718 SG CYS A 97       3.734 16.785 -18.443 1.00 18.15           S
REMARK     MATCH. N.*|O.*|S.*
REMARK     RESMATCH. CYS|HIS|HOH
ATOM    737 SG CYS A 100       2.286 15.349 -15.180 1.00 18.51           S
REMARK     MATCH. N.*|O.*|S.*
REMARK     RESMATCH. CYS|HIS|HOH
ATOM    761 SG CYS A 103       5.608 14.289 -16.252 1.00 16.69           S
REMARK     MATCH. N.*|O.*|S.*
REMARK     RESMATCH. CYS|HIS|HOH
ATOM    826 SG CYS A 111       3.074 13.055 -18.482 1.00 21.91           S
REMARK     MATCH. N.*|O.*|S.*
REMARK     RESMATCH. CYS|HIS|HOH
END

CSR tests have been run on various famillies of proteins. For the pectate lyase familly we ran the 2PEC (green, 352 alpha-carbons) PDB entry versus the 2BSP (pink, 402 alpha-carbons) (alpha carbon traces), during 5000 iterations. The cutoff distance was set to 5. This resulted in a match involving 102 alpha-carbons with a RMSd of 1.56 Angstroms. This match was found at iteration 102.

For 2 electron transfert iron-sulfur proteins (PDB codes 1ISU (yellow, 126 alpha-carbons) and 2HIP (blue, 144 alpha-carbons) ), CSR found a match involving 48 pairs with a RMSd 1.01, at iteration 800. (the cutoff distance was set to 5.)

References
[1] Escan: Escalier, V., J. Pothier, H. Soldano and A. Viari "Pairwise and multiple identification of three-dimensional common substructures in proteins." J. Computational Biology (1998) 5(1):41-56.
[2] CSR: M. Petitjean "Interactive Maximal Common 3D Substructure Searching with the Combined SDM/RMS Algorithm" Comput. Chem. (1998) 22[6],463-465).