BCSearch

Services to mine collections of protein structures to identify fragments similar to a query.

Overview

The increasing number of 3D structures experimentally solved makes timely tools to explore their diversity. Whereas numerous tools have been developed for the structural alignment of complete protein structures or complete domains, tools to explore their conformations at a more local level (that of small fragments) face the limit of the huge number of comparisons to perform, as well as the relevance of the similarity criteria to detect relevant similarities for small fragments - most similarity scores are relevant only for large enough fragments (see [1]). This hampers efforts to deeper understand the local sequence structure relationship, of interest in various contexts such as protein design (mining collection of structures to grab sequences compatible with a local shape), protein modeling (identify alternative conformations that could match boundary conditions), or more generally ask about the frequency of occurrence of a particular local conformation in the available structures.

BCSearch is a fast and flexible approach to identify linear fragments similar to a query in large collections of structures. It addresses two basic questions: (i) among a subset of structures, what sequences are compatible with the conformation of my query, (ii) are there conformations similar to my query in other proteins, is it observed elsewhere ? BCSearch is based on a new similarity approach, based on a Binet Cauchy (BC) kernel[1]. The approach measures the correlation between the volumes of all the tetraedron of the query and that of a target. The similarity (BCscore) is scored between -1 and 1, where a value of 1 corresponds to the exact same conformation than the query, and -1 to the mirror conformation. Values close to 0 correspond to unrelated fragments. The BCscore is more stringent than other criteria such as the alpha carbon RMS deviation. Particularly, fragments with partly dissimilar shapes are poorly scored and consequently collections of matches are usually less noisy, which makes them better suited for the analysis of the local structure-sequence relationship. In addition, since no superimposition is required, the similarity search is very fast, making possible to mine large collections of structures.

BCSearch presently allows for four types of search.

  • The first one (BCFragSearch) corresponds to the ungapped search for fragments similar to a query, and thus provides means to explore, from ensembles of structures, the amino acid sequence variability compatible with a given conformation.
  • The second one (BCLoopSearch) corresponds to the search of fragments matching boundary conditions. Only the flanks of a query fragment being specified, BCSearch will return fragments of the available structures that match the geometry of the boundaries, and thus providing ensemble of conformations that can be used, for instance, in the context of loop redesign.
  • Since 2015 January 19th, a third one (BCMirrorSearch) corresponds to the ungapped search for fragments similar to the mirror of a query, and thus provides means to explore, from ensembles of structures, the amino acid sequence variability considering chiral conformations.
  • Since 2015 February 28th, a fourth one (BCSpecificitySearch) addresses the question of the specificity of the fragment conformations in complete protein structures, identifying sites associated with few occurrences of similar fragments. It systematically searches for similar fragments of all fragments of size k of the query protein, and returns a score measuring how specific each fragment is. In practice, the score is: sp = 1 - Nhits/Ntotal where Nhits is the number of proteinswhere a similar fragment was found, and Ntotal is the total number of proteins in the search bank. Thus, sp varies between 0 and 1, where 0 means the fragment was found in each protein of the bank, and 1 means no homolog was found in the bank.

Click on the buttons to access the BCSearch or BCLoop services @ the RPBS Mobyle Portal.

Minimal fragment size is 6 residues.

When using this services, please cite the following reference:

Guyon F, Martz F, Vavrusa M, Bécot J, Rey J, Tufféry P.
BCSearch: fast structural fragment mining over large collections of protein structures.
Nucleic Acids Res. 2015 Jul 1;43(W1):W378-82.

Features

  • Ungapped fragment similarity search

    Starting from a query, mine large collections of structures to identify similar fragments. Get both fragments and sequences in return.

  • Search constraints

    BCFrag/MirrorSearch and BCLoopSearch make use of 2 parameters to drive the search:

    1. The BC score, betwen 1 (perfect match) and -1 (perfect mirror).
    2. The fragment deformation: since the kernel score if flexible, a maximum deformation makes easy to narrow the matches in the vicinity of the query.
      1. Their tuning differ slightly since BCLoopSearch only consider flanks, not a complete fragment, which supports less conformational variability. For BCLoopSearch, and additional constraint is that the fragment should not have steric clash with the complete protein structure. In order not to discard hits for which further refinement could solve clashes, this is only achieved using a minimal inter alpha carbon distance fo 3 Angstrom.

Limitations

Fragment size:

Presently, the search is based on alpha carbon coordinates. Too small fragments (less than 6 amino acids) can lead to unreliable results. There is in theory no upper limit, but the probability to identify long similar ungapped fragments is low.

Ungapped search:

Presently, the search does not consider small adjustments that could arise considering some indels, as can be observed among the evolution of some functional motifs. This point remains the subject for further investigations.

Usage

Demonstration mode

  • Pre-configured test:

    By setting this option to "Yes", BCSearch will be run using predefined data, a fragment of PDB entry 2EMJA for BCFragSearch and PDB entry 1AKE from which two regions have been removed for BCLoopSearch. Both searches are performed with a minimal BCscore of 0.95 and a maximal rigidity of 1 Angstrom.

Input

  • Input structure

    Input query file must be in PDB format. The PDB input field can be filled in 3 different ways:

    • Click on the paste tab and then copy/paste your PDB file in the text field, or
    • Click on the db tab, then select a databank in the select menu, then type in a valid pdb id or a valid SCOPe id, depending on the databank selected. For PDB ids (on the form 1tim), it is possible to select only a chain by adding the label of the chain to the end of the PDB id (e.g. 1timB), then click select, or
    • Click on the upload tab, then click choose a file, then select your file and click open to finish.

    Currently, PDBs with multiple chains are not supported. If the PDB contains multiple chains, the chain ID must be specified. If not specified, the first chain (in alphabetical order) will be selected automatically.

  • Reference sequence

    • For BCFrag/MirrorSearch, this sequence corresponds to the sequence of the fragment of the input protein that will be searched, i.e. only the fraction of the PDB input corresponding to that sequence will be used for the similarity search.

      If the fragment sequence is ambiguous, the software takes the first occurence of the sequence in the input. Setting the Skip N first residues field allows you to mask first N residues e.g. 15 means that the software will skip the first 15 residues, and take the first occurence of the fragment sequence in the rest.

    • For BCLoopSearch, this sequence corresponds to the sequence of the complete protein, i.e. including the sequence of the missing "loops". The sequence should match PDB Nterminus and Cterminus, otherwise an error will be returned.

    Accepted formats are Fasta and bare sequence (which will be automatically converted to Fasta). A Fasta format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:

    >sequence
    CAECGKSFSISSQLATHQRIHTG
  • Bank to mine

    Collections of structure to mine correspond either to subset of th PDB defined in the culled PDB [2] or from the SCOPe structural domain compendium [3]. Different lists at 100, 90, 70 , 50 and 30% sequence identity are predefined.

    • For the culled PDB, the collections are extracted from structures determined by X-ray diffraction. Lists are available that correspond to a resolutions of 1.6, 1.8, 2.0, 2.2 or 2.5 Angstrom or better, and R-value less than 0.25 for all but the 2.2 and 2.5 for wich a R-value cutoff of 1. is used. Lists at a lower resolution are presently not proposed since due to the accuracy of the BCscore, it makes few sense to mine strutures with large uncertainty on their coordinates. Note however such filtering is not performed using the SCOPe lists.
    • For SCOPe, there is no distinction on the origin and quality of the structures, but it is in addition possible to specify any of the class (c), fold (f), superfamily (s), family (f) using the c.f.s.f scheme (e.g. a.1.1.1), see the SCOPe documentation for mor explanations. A list of the accepted selections is available at: http://bioserv.rpbs.univ-paris-diderot.fr/services/BCSearch/SCOP-hierarchy.txt. Specifying "all" corresponds to all the SCOPe entries at the specified sequence identity (i.e. over 190 000 domains at 100% sequence identity).
    • For BCSpecificitySearch, it is in addition possible to specify a subset to exclude of the bank to mine, using the same conventions. The aim of this exclusion is to remove homologs of the structure analyzed, thus to exclude occurrences of similar conformations in these homologs for better contrast. For instance, to analyze a protein such as 1QQU, a member of the b.47 SCOPe fold, it is possible to search in the b class, excluding the b.47 folds to search for specific sites.

Search parameters

  • Minimal BCscore

    Relevant predefined BCscore values are proposed. For BCFragSearch and BCMirrorSearch, values below 0.85 are left to make possible large deviations from the query, but must be used with care.

  • Maximal rigidity

    It corresponds to the maximal deviation of the distance of the center of mass to one alpha carbon, between the query and a hit. Note it has exactly the same meaning for BCFragSearch an BCMirrorSearch, distances being identical for a conformation and its mirror conformation.

  • k-mer length

    This parameter is for BCSpecificSearch only. It corresponds to the size of the sliding window used to analyze fragments specificity in the input structure. Only three values (9,11,13) are possible, larger sizes being associated with fewer occurrences.

Step by step tutorial

  • BCFragSearch, BCLoopSearch & BCMirrorSearch

    This section describes the basic steps to fill a Mobyle form for the BCFragSearch, BCLoopSearch & BCMirrorSearch services and how to launch a job with a short example.

    • First of all, make sure that the Demonstration mode is set to No.

    • PDB input:

      • Click on the db tab to download a pdb file from a databank. Select pdb in the list, then type in 2emjA in the text field, then press the select button. This will populate the PDB input field.

      • You can leave the PDB chain ID field empty since the chain ID was already specified in the previous step.

    • Reference sequence:

      • Click on the paste tab, and copy/paste the following sequence in fasta format in the field below.

        >2EMJ_zn_fng_scop90_all_bc0925_r1
        CAECGKSFSISSQLATHQRIHTG
      • Leave the Skip N first residues (not present in BCLoopSearch) to 0, it is not needed in this case.

      • In the Search list, select SCOPe.

      • Type in g.37.1 in the SCOPe class text area.

      • Set the % identity to 100% by selecting the correct value in the list.

    • Parameters:

      • Set Minimal BC score to 0.9.

      • Finally, set the Maximal rigidity to 2 Å.

    • Launch the job:

      • To launch your job, scroll up to the top of the form and click the Run button and wait till it is finished.

    Scroll down to the Results section for a description of job outputs.

  • BCSpecificitySearch

    This section describes the basic steps to fill a Mobyle form for the BCSpecificitySearch service and how to launch a job with a short example.

    • First of all, make sure that the Demonstration mode is set to No.

    • PDB input:

      • Click on the db tab to download a pdb file from a databank. Select pdb in the list, then type in 1qqu in the text field, then press the select button. This will populate the PDB input field.

      • You can leave the PDB chain ID field empty since the PDB contains only one chain (chain A will be selected automatically).

    • Search space:

      • Type in b in the SCOP(e) field.

      • Type in b.47 in the Exclude SCOP(e) field.

      • Set the k-mer length to 9 (default value) by selecting the correct value in the list.

      • Set the % identity to 90%.

      • Set the Filter results to No filter (default value).

    • Parameters:

      • Set Minimal BC score to 0.95.

      • Finally, set the Maximal rigidity to 1 Å.

    • Launch the job:

      • To launch your job, scroll up to the top of the form and click the Run button and wait till it is finished.

    Scroll down to the Results section for a description of job outputs.

Results

  • Progress report

    ProgressReport

    BCSearch incrementally returns information about job progression and errors if any, although typical run times are on the order of few seconds to few minutes only. Errors related to invalid input data are now also reported in this section.

  • Visualization of best matches

    BCSearch Best Matches View

    BCFragSearch, BCLoopSearch & BCMirrorSearch best matches visualization

    Up to the 10 best matches can be visualized superimposed onto the query using the PV javascript protein viewer. Different coloring operations are available: Uniform colors the structure in a uniform color; Succession colors the structure's secondary structure elements with a gradient; Occupancy and Temperature factor color the structure according to the occupancy and temperature factor columns respectively (if present in the PDB file). Temperature threshold can be adjusted with the sliding bar.

    BCSearch Best Matches View

    BCSpecificitySearch best matches visualization

    Structure is colored according to the specificity score calculated for each residue: most specific sites are displayed in red, and least specific in blue. Specificity threshold can be adjusted with the sliding bar.

  • Tabulated presentation of the matches

    BCSearch Matches Table

    The results are presented in a interactive tabulated format that can also been downloaded as a csv file. By default, the matches are returned sorted by decreasing values of the kernel score. This table can be sorted interactively by clicking the columns headers.

    For each match, the Query and match PDB entry (Hit) are reported together with their limits (QueryFrom, QueryTo, HitsFrom, HitsTo), their BCscore and Rigidity, the associated P-value, the RMSd (as a supplementary information) between the query and the match, and the Sequence of the match.

  • Logo representation of the match sequences

    A logo representation of the sequences of the matches provides a schematic view of sequence variability among the matches.

    Model

Examples

  • BCFragSearch

    The PDB entry 2EMJ contains a zinc finger motifs in the fragment of sequence: CAECGKSFSISSQLATHQRIHTG. Searching the SCOPe superfamily g.37.1 (beta-beta-alpha_zinc_fingers) at 100% sequence identity, using BCscore and rigidity values of 0.9 and 2. respectively, BCFragSearch identifies 20 matches. Among them, 18 do have the CXXC*HXXXH motif specific of zinc fingers. The two remaining hits correspond to zinc finger motifs for which indels occur, and that cannot be detected with an ungapped approach - a present limit of BCSearch. Figures 1.A and 1.B show the display as proposed online and the resulting logo representation of the sequence. Extending the search to SCOPe k class (Designed_proteins), 9 additional matches are identified among which the 1psv entry, that has the shape of Zinc fingers but not the ability to bind zinc. 1psc candidate positions to subsitute by cysteines and histidines to get zinc binding can be directly grabbed from the alignment.

    Fig1.A

    Fig 1.A

    All found fragments superposed on the query structure can be visualized thanks to the PV javascript protein viewer. Fragments are ranked by BCscore and can be visualized individually or all together.

    Fig1.A

    Fig 1.B

    A logo representation of the sequences of all the matches provides a schematic view of sequence variability among the matches. The BCFragSearch exemple is ran on a Zinc Finger structure, the C2H2 pattern is evident from the 20 hits as seen on this logo representation.

  • BCMirrorSearch

    The search for geometrical mirror conformations can encompass different aspects. In [1], we have shown how simple it is to locate left handed helices from alpha hhelices. Here we illustrate the concept for beta hairpins. The PDB entry 1E0N contains a beta hairpin associated with the fragment of sequence: YNAEQKTK. Are there fragments in the PDB that would encode for a mirro image of this hairpin (i.e. fragment with the hairpin bent to the opposite direction ? Searching the culledPDB subset (resolution better than 1.8, R-value less than 0.25) at 30% sequence identity, using BCscore and rigidity values of 0.95 and 1., respectively, BCMirrorSearch identifies 6 fragments that correspond to mirror conformations, associated with most probable sequence FDNGDDEG (Fig 2A), that do correspond to a mirror conformation, as shown Fig 2B.

    Fig2.A

    Fig 2.A

    A logo representation of the sequences of the 6 matches provides a schematic view of sequence variability among the matches. The sequence made of the most probable amino acids is FDNGDDEG, largely differs from that of the query - YNAEQKTK.

    Fig2.B

    Fig2.B

    All of the 6 mirror fragments are superimposed on the query structure visualized thanks to the PV javascript protein viewer.

  • BCLoopSearch

    The loop at position 62-82 of the RAN protein (PDB:1BYU) undergoes a conformational change (3.9 A) upon binding to form a RAN-RCC1 complex (PDB:1I2M) (Figure 3.A).

    Starting from the unbound conformation, and removing residues 66-78, a search using BCLoopSearch with the SCOP 100% sequence identity identifies 291 conformations (Fig 3.B) among which that of the bound conformation, but also numerous conformations covering a range of RMS deviation from 2.1 A to 6.4 A.

    Not considering RCC1 in complex, BCLoopSearch identifies a conformation that deviates from the bound one by 2.1 A (Fig 3.C), from the protein Rab28 GTPase in the active form (PDB:3E5H).

    Fig. 2.A

    Fig 3.A

    RAN protein in unbound (PDB:1BYU - green) and bound conformations (PDB:1I2M - blue)

    Fig. 2.B

    Fig 3.B

    Matching conformations as identified by BCLoopSearch (missing part corresponds to residues 66-78)

    Fig. 2.C

    Fig 3.C

    Bound closest conformation from protein Rab28 GTPase (PDB:3E5H)

  • BCSpecificitySearch

    The PDB entry 1BEC (the beta chain of a T cell antigen receptor) is a member of the beta class of SCOPe, and a member of the b.1 fold (Immunoglobulin-like beta-sandwich). Searching the specific sites can be undergone in SCOPe b class, excluding the b.1 fold, with the default parameters (k-mer size of 9, 90% sequence identity, BC score and rigidity cut-off values of 0.95 and 1, respectively. All the sites are located in the same region of the structure defining a patch (left). Actually, 1BEC biological unit is dimeric (right). Most of specific residues are located at the interface.

    Fig. 1bec.png
    Fig. 1bec.png

Common errors and solutions

  • No pdb_file specified

    Means that the field PDB input has been left empty or filled with wrong data. This field can be filled in 3 different ways:

    • Click on the paste tab and then copy/paste your PDB file in the text field, or
    • Click on the db tab, then select a database in the select menu, then type in the pdb id, then click select, or
    • Click on the upload tab, then clickchoose a file, then select your file and click open to finish.

  • Chain ID='X' not present in the PDB file, select one of the following: 'A', 'B', 'C'

    Means that the chain id specified in the field PDB chain ID was not found in the PDB file. The user is thus provided with the list of chain ids present in the file. If no chain id is specified, the first chain (in alphabetical order) will be selected.

  • Cannot detect missing residues

    Means that the provided PDB for the BCLoopSearch service does not have any missing residues, therefore no gaps to fill.

  • Input seed sequence is not a valid Fasta sequence.

    Means that the sequence specified in the Reference sequence field is not in a valid format. Accepted formats are Fasta and bare sequence. A Fasta format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:

    >sequence
    CAECGKSFSISSQLATHQRIHTG
  • Seed sequence not found in PDB

    Means that the seed sequence is not present in the query PDB file, therefore the BCMirrorSearch and BCFragSearch can not discover fragments for this subset of the query PDB.

History

  • 2010, September 10th - Initial development of the kernel score.
  • 2011, November 17th - First draft of a global approach to mine large collections fo structures.
  • 2012, September 1st - First implementation of the GLSearch service (restricted access). Internal tests.
  • 2013, January 3rd - First release of the service.
  • 2014, March - Publication of the Binet Cauchy kernel approach of structural similarity.
  • 2014, October 30th - GLSearch, too complex, is superseded by BCSearch. Two sub services corresponding to fragment and loop search are proposed.
  • 2015, January 19th - BCMirror search, that takes advantage of the possibility to identify mirror conformations using the BCscore is implemented.
  • 2015, February 28th - BCSpecificity search, evaluates how fragment conformations in a complete protein structure are specific.

Known problems and answers

There has been very few feedback reporting problems with GLSearch, most were related to Java and browser dependent behavior. We anticipate the newer BCSearch that makes use of webGL based javascript instead of Java will only solve part of it. BCSearch has been tested successfully using various OS / browser combinations, including:

  • Firefox and Chrome under Linux,
  • Internet Explorer, Firefox, Chrome under Windows 7 (no webGL working),
  • Safari, Firefox and Chrome under MacOS X (lion, snow leopard).

References

[1] Guyon F, Tufféry P.
Fast protein fragment similarity scoring using a Binet-Cauchy kernel.
Bioinformatics. 2014 Mar 15;30(6):784-91. doi: 10.1093/bioinformatics/btt618. Epub 2013 Oct 27.
[2] G. Wang and R. L. Dunbrack, Jr.
PISCES: a protein sequence culling server.
Bioinformatics, 19:1589-1591 (2003).
[3]Fox NK, Brenner SE, Chandonia JM
SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures.
Nucleic Acids Research 42:D304-309. doi: 10.1093/nar/gkt1240. (2014).