BCSearch
Services to mine collections of protein structures to identify fragments similar to a query.
Services to mine collections of protein structures to identify fragments similar to a query.
The increasing number of 3D structures experimentally solved makes timely tools to explore their diversity. Whereas numerous tools have been developed for the structural alignment of complete protein structures or complete domains, tools to explore their conformations at a more local level (that of small fragments) face the limit of the huge number of comparisons to perform, as well as the relevance of the similarity criteria to detect relevant similarities for small fragments - most similarity scores are relevant only for large enough fragments (see [1]). This hampers efforts to deeper understand the local sequence structure relationship, of interest in various contexts such as protein design (mining collection of structures to grab sequences compatible with a local shape), protein modeling (identify alternative conformations that could match boundary conditions), or more generally ask about the frequency of occurrence of a particular local conformation in the available structures.
BCSearch is a fast and flexible approach to identify linear fragments similar to a query in large collections of structures. It addresses two basic questions: (i) among a subset of structures, what sequences are compatible with the conformation of my query, (ii) are there conformations similar to my query in other proteins, is it observed elsewhere ? BCSearch is based on a new similarity approach, based on a Binet Cauchy (BC) kernel[1]. The approach measures the correlation between the volumes of all the tetraedron of the query and that of a target. The similarity (BCscore) is scored between -1 and 1, where a value of 1 corresponds to the exact same conformation than the query, and -1 to the mirror conformation. Values close to 0 correspond to unrelated fragments. The BCscore is more stringent than other criteria such as the alpha carbon RMS deviation. Particularly, fragments with partly dissimilar shapes are poorly scored and consequently collections of matches are usually less noisy, which makes them better suited for the analysis of the local structure-sequence relationship. In addition, since no superimposition is required, the similarity search is very fast, making possible to mine large collections of structures.
BCSearch presently allows for four types of search.
Click on the buttons to access the BCSearch or BCLoop services @ the RPBS Mobyle Portal.
When using this services, please cite the following reference:
BCSearch: fast structural fragment mining over large collections of protein structures.
Nucleic Acids Res. 2015 Jul 1;43(W1):W378-82.
Starting from a query, mine large collections of structures to identify similar fragments. Get both fragments and sequences in return.
BCFrag/MirrorSearch and BCLoopSearch make use of 2 parameters to drive the search:
Their tuning differ slightly since BCLoopSearch only consider flanks, not a complete fragment, which supports less conformational variability. For BCLoopSearch, and additional constraint is that the fragment should not have steric clash with the complete protein structure. In order not to discard hits for which further refinement could solve clashes, this is only achieved using a minimal inter alpha carbon distance fo 3 Angstrom.
Presently, the search is based on alpha carbon coordinates. Too small fragments (less than 6 amino acids) can lead to unreliable results. There is in theory no upper limit, but the probability to identify long similar ungapped fragments is low.
Presently, the search does not consider small adjustments that could arise considering some indels, as can be observed among the evolution of some functional motifs. This point remains the subject for further investigations.
By setting this option to "Yes", BCSearch will be run using predefined data, a fragment of PDB entry 2EMJA for BCFragSearch and PDB entry 1AKE from which two regions have been removed for BCLoopSearch. Both searches are performed with a minimal BCscore of 0.95 and a maximal rigidity of 1 Angstrom.
Input query file must be in PDB format. The PDB input field can be filled in 3 different ways:
Currently, PDBs with multiple chains are not supported. If the PDB contains multiple chains, the chain ID must be specified. If not specified, the first chain (in alphabetical order) will be selected automatically.
For BCFrag/MirrorSearch, this sequence corresponds to the sequence of the fragment of the input protein that will be searched, i.e. only the fraction of the PDB input corresponding to that sequence will be used for the similarity search.
If the fragment sequence is ambiguous, the software takes the first occurence of the sequence in the input. Setting the Skip N first residues field allows you to mask first N residues e.g. 15 means that the software will skip the first 15 residues, and take the first occurence of the fragment sequence in the rest.
Accepted formats are Fasta and bare sequence (which will be automatically converted to Fasta). A Fasta format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:
>sequence CAECGKSFSISSQLATHQRIHTG
Collections of structure to mine correspond either to subset of th PDB defined in the culled PDB [2] or from the SCOPe structural domain compendium [3]. Different lists at 100, 90, 70 , 50 and 30% sequence identity are predefined.
Relevant predefined BCscore values are proposed. For BCFragSearch and BCMirrorSearch, values below 0.85 are left to make possible large deviations from the query, but must be used with care.
It corresponds to the maximal deviation of the distance of the center of mass to one alpha carbon, between the query and a hit. Note it has exactly the same meaning for BCFragSearch an BCMirrorSearch, distances being identical for a conformation and its mirror conformation.
This parameter is for BCSpecificSearch only. It corresponds to the size of the sliding window used to analyze fragments specificity in the input structure. Only three values (9,11,13) are possible, larger sizes being associated with fewer occurrences.
This section describes the basic steps to fill a Mobyle form for the BCFragSearch, BCLoopSearch & BCMirrorSearch services and how to launch a job with a short example.
First of all, make sure that the Demonstration mode is set to No.
PDB input:
Click on the db tab to download a pdb file from a databank. Select pdb in the list, then type in 2emjA in the text field, then press the select button. This will populate the PDB input field.
You can leave the PDB chain ID field empty since the chain ID was already specified in the previous step.
Reference sequence:
Click on the paste tab, and copy/paste the following sequence in fasta format in the field below.
>2EMJ_zn_fng_scop90_all_bc0925_r1 CAECGKSFSISSQLATHQRIHTG
Leave the Skip N first residues (not present in BCLoopSearch) to 0, it is not needed in this case.
In the Search list, select SCOPe.
Type in g.37.1 in the SCOPe class text area.
Set the % identity to 100% by selecting the correct value in the list.
Parameters:
Set Minimal BC score to 0.9.
Finally, set the Maximal rigidity to 2 Å.
Launch the job:
To launch your job, scroll up to the top of the form and click the Run button and wait till it is finished.
Scroll down to the Results section for a description of job outputs.
This section describes the basic steps to fill a Mobyle form for the BCSpecificitySearch service and how to launch a job with a short example.
First of all, make sure that the Demonstration mode is set to No.
PDB input:
Click on the db tab to download a pdb file from a databank. Select pdb in the list, then type in 1qqu in the text field, then press the select button. This will populate the PDB input field.
You can leave the PDB chain ID field empty since the PDB contains only one chain (chain A will be selected automatically).
Search space:
Type in b in the SCOP(e) field.
Type in b.47 in the Exclude SCOP(e) field.
Set the k-mer length to 9 (default value) by selecting the correct value in the list.
Set the % identity to 90%.
Set the Filter results to No filter (default value).
Parameters:
Set Minimal BC score to 0.95.
Finally, set the Maximal rigidity to 1 Å.
Launch the job:
To launch your job, scroll up to the top of the form and click the Run button and wait till it is finished.
Scroll down to the Results section for a description of job outputs.
BCSearch incrementally returns information about job progression and errors if any, although typical run times are on the order of few seconds to few minutes only. Errors related to invalid input data are now also reported in this section.
The results are presented in a interactive tabulated format that can also been downloaded as a csv file. By default, the matches are returned sorted by decreasing values of the kernel score. This table can be sorted interactively by clicking the columns headers.
For each match, the Query and match PDB entry (Hit) are reported together with their limits (QueryFrom, QueryTo, HitsFrom, HitsTo), their BCscore and Rigidity, the associated P-value, the RMSd (as a supplementary information) between the query and the match, and the Sequence of the match.
A logo representation of the sequences of the matches provides a schematic view of sequence variability among the matches.
The PDB entry 2EMJ contains a zinc finger motifs in the fragment of sequence: CAECGKSFSISSQLATHQRIHTG. Searching the SCOPe superfamily g.37.1 (beta-beta-alpha_zinc_fingers) at 100% sequence identity, using BCscore and rigidity values of 0.9 and 2. respectively, BCFragSearch identifies 20 matches. Among them, 18 do have the CXXC*HXXXH motif specific of zinc fingers. The two remaining hits correspond to zinc finger motifs for which indels occur, and that cannot be detected with an ungapped approach - a present limit of BCSearch. Figures 1.A and 1.B show the display as proposed online and the resulting logo representation of the sequence. Extending the search to SCOPe k class (Designed_proteins), 9 additional matches are identified among which the 1psv entry, that has the shape of Zinc fingers but not the ability to bind zinc. 1psc candidate positions to subsitute by cysteines and histidines to get zinc binding can be directly grabbed from the alignment.
The search for geometrical mirror conformations can encompass different aspects. In [1], we have shown how simple it is to locate left handed helices from alpha hhelices. Here we illustrate the concept for beta hairpins. The PDB entry 1E0N contains a beta hairpin associated with the fragment of sequence: YNAEQKTK. Are there fragments in the PDB that would encode for a mirro image of this hairpin (i.e. fragment with the hairpin bent to the opposite direction ? Searching the culledPDB subset (resolution better than 1.8, R-value less than 0.25) at 30% sequence identity, using BCscore and rigidity values of 0.95 and 1., respectively, BCMirrorSearch identifies 6 fragments that correspond to mirror conformations, associated with most probable sequence FDNGDDEG (Fig 2A), that do correspond to a mirror conformation, as shown Fig 2B.
The loop at position 62-82 of the RAN protein (PDB:1BYU) undergoes a conformational change (3.9 A) upon binding to form a RAN-RCC1 complex (PDB:1I2M) (Figure 3.A).
Starting from the unbound conformation, and removing residues 66-78, a search using BCLoopSearch with the SCOP 100% sequence identity identifies 291 conformations (Fig 3.B) among which that of the bound conformation, but also numerous conformations covering a range of RMS deviation from 2.1 A to 6.4 A.
Not considering RCC1 in complex, BCLoopSearch identifies a conformation that deviates from the bound one by 2.1 A (Fig 3.C), from the protein Rab28 GTPase in the active form (PDB:3E5H).
The PDB entry 1BEC (the beta chain of a T cell antigen receptor) is a member of the beta class of SCOPe, and a member of the b.1 fold (Immunoglobulin-like beta-sandwich). Searching the specific sites can be undergone in SCOPe b class, excluding the b.1 fold, with the default parameters (k-mer size of 9, 90% sequence identity, BC score and rigidity cut-off values of 0.95 and 1, respectively. All the sites are located in the same region of the structure defining a patch (left). Actually, 1BEC biological unit is dimeric (right). Most of specific residues are located at the interface.
Means that the field PDB input has been left empty or filled with wrong data. This field can be filled in 3 different ways:
Means that the chain id specified in the field PDB chain ID was not found in the PDB file. The user is thus provided with the list of chain ids present in the file. If no chain id is specified, the first chain (in alphabetical order) will be selected.
Means that the provided PDB for the BCLoopSearch service does not have any missing residues, therefore no gaps to fill.
Means that the sequence specified in the Reference sequence field is not in a valid format. Accepted formats are Fasta and bare sequence. A Fasta format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:
>sequence CAECGKSFSISSQLATHQRIHTG
Means that the seed sequence is not present in the query PDB file, therefore the BCMirrorSearch and BCFragSearch can not discover fragments for this subset of the query PDB.
There has been very few feedback reporting problems with GLSearch, most were related to Java and browser dependent behavior. We anticipate the newer BCSearch that makes use of webGL based javascript instead of Java will only solve part of it. BCSearch has been tested successfully using various OS / browser combinations, including:
[1] Guyon F, Tufféry P.
Fast protein fragment similarity scoring using a Binet-Cauchy kernel.
Bioinformatics. 2014 Mar 15;30(6):784-91. doi: 10.1093/bioinformatics/btt618. Epub 2013 Oct 27.
[2] G. Wang and R. L. Dunbrack, Jr.
PISCES: a protein sequence culling server.
Bioinformatics, 19:1589-1591 (2003).
[3]Fox NK, Brenner SE, Chandonia JM
SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures.
Nucleic Acids Research 42:D304-309. doi: 10.1093/nar/gkt1240. (2014).