This paper is an extended version of our paper published in Algorithms for Computational Biology. Wohlers, I.; Le Boudic-Jamin, M.; Djidjev, H.; Klau, G. W.; Andonov, R. Exact Protein Structure Classification Using the Maximum Contact Map Overlap Metric, In the Proceeding of the First International Conference, AlCoB 2014, Tarragona, Spain, 1–3 July 2014; pp.262–273.

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

In this work, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact

Understanding the functional role and evolutionary relationships of proteins is key to answering many important biological and biomedical questions. Because the function of a protein is determined by its structure and because structural properties are usually conserved throughout evolution, such problems can be better approached if proteins are compared based on their representations as three-dimensional structures rather than as sequences. Databases, such as SCOP (Structural Classification of Proteins) [

Both SCOP and CATH, however, are constructed partly based on manual curation, and many of the currently over 90, 000 protein structures in the Protein Databank (PDB) [

One approach to solving that problem is based on having introduced a meaningful distance measure between any two protein structures. Then, the family of a query protein

Several such distances have been proposed, each having its own advantages. A number of approaches based on a graph-based measure of closeness called contact map overlap (CMO) [

In this work, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric. The advantages of nearest neighbor searching in metric spaces are well described in the literature [

Amongst the other existing (non-CMO) protein structure comparison methods, we are aware of only one exploiting the triangle inequality. This is the so-called scaled Gauss metric (SGM) introduced in [

We focus here on the notions of contact map overlap (CMO) and the related max-CMO distance between protein structures. A contact map describes the structure of a protein

The alignment visualized with dashed lines

Computing

Note that

Hence:

Furthermore,

Hence,

Four contact map graphs.

Let

The following claim states that

The mapping

Similarly, we compute the mapping corresponding to

On the other hand,

Combining the last inequality with Equation (

If instead of

We suggest to approach the problem of classifying a given query protein structure with respect to a database of target structures based on a majority vote of the

An important feature of our approach is that it is based on a metric, and we fully profit from all usual benefits when exploiting the structure introduced by that metric. In addition, we also model each protein family in the database as a ball with a specially-chosen protein from the family as the center, see

In order to minimize the number of targets with which a query has to be compared directly,

In order to find

In order to find the target structures that are closest to a query

In order to conclude that

Given a query

We define the triangle upper (respectively lower) bound as:

Using Lemma 5, we derive supplementary sufficient conditions for dominance, which we call indirect dominance.

The priority queues LB and UB are sorted in the order of increasing distance. The

We assume that distances between family members are computed optimally (this is actually done in our preprocessing step when computing the family representatives),

We evaluated the classification performance and efficiency of different types of dominance of our algorithm on domains from SCOPCath [

For every protein class, the table lists the number of structures in SCOPCath (str) and extended SCOPCath (ext), the corresponding number of families (fam) and superfamilies (sup).

Class | a | b | c | d | e | f | g | h | i | j | k |
---|---|---|---|---|---|---|---|---|---|---|---|

# str | 1195 | 1593 | 1774 | 1591 | 30 | 103 | 342 | 72 | 11 | 38 | 10 |

# ext | 10,796 | 19,215 | 17,497 | 15,679 | 349 | 1006 | 2398 | 520 | 43 | 81 | 25 |

# fam | 524 | 516 | 548 | 632 | 6 | 59 | 121 | 32 | 5 | 29 | 8 |

# sup | 303 | 266 | 191 | 375 | 6 | 52 | 82 | 31 | 5 | 29 | 8 |

For classification, we randomly selected one query from every family with at least six members. This resulted in 236 queries for SCOPCath and 1369 queries for the extended SCOPCath benchmark.

We then computed all-

For every query, the

We compare our exact

In order to investigate the impact of

LB

UB

FAM

Recompute

Update priority of

Update priority of

//

LB ← LB

UB ← UB

FAM

Apply the dominance protocol for query

In the first preprocessing step, we evaluate how well our distance metric captures known similarities and differences between protein structures by computing intra-family and inter-family distances. A good distance for structure comparison should pool similar structures,

Histograms of intra-family distances divided by class: (

We then compute a radius around the representative structure that encompasses all structures of the corresponding family. The number of families with a given radius decreases nearly linearly from zero to 0.6, with most families having a radius close to zero and almost no families having a radius greater than 0.6. The histogram of family radii is visualized in

A histogram of the radii of the multi-member families.

Histograms of overlap values between any two multi-member families for the four main classes a–d: (

Considering that the distance metric is bound to be within zero and one, intra-family distances and radii show that the distance overall captures the similarity between structures well. Further, we investigate the distance between protein families by computing their overlap value as defined by

When classifying the 236 queries of SCOPCath, we achieve between 89% and 95% correct superfamily assignments; see

Classification results showing the number of queries out of the overall 236 queries that have been assigned to a superfamily, the number of correct assignments, the number of assignments computed exactly, thereof the number of correct classifications and the number of ties that do not allow a superfamily assignment based on majority vote. The last two lines display the number of correct assignments and ties for

k | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
---|---|---|---|---|---|---|---|---|---|---|

# correct | 210 | 211 | 213 | 213 | 214 | 217 | 217 | 219 | 213 | 224 |

# exact | 117 | 143 | 156 | 165 | 188 | 206 | 204 | 211 | 209 | 234 |

# exact and correct | 110 | 134 | 149 | 155 | 178 | 198 | 195 | 205 | 206 | 224 |

# ties | 10 | 9 | 11 | 8 | 10 | 10 | 10 | 10 | 20 | 0 |

# TM-align correct | 219 | 220 | 220 | 225 | 225 | 228 | 226 | 227 | 226 | 228 |

# TM-align ties | 4 | 4 | 9 | 5 | 5 | 3 | 8 | 5 | 8 | 0 |

Boxplots of the percentage of removed targets at each iteration during triangle and pairwise dominance for the 236 queries of the SCOPCath benchmark.

Our exact

Classification results showing the number of queries out of the overall 1369 queries that have been assigned to a superfamily, the number of correct assignments, the number of assignments computed exactly, thereof the number of correct classifications and the number of ties that do not allow a superfamily assignment based on majority vote. The last two lines display the number of correct assignments and ties for

k | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
---|---|---|---|---|---|---|---|---|---|---|

# correct | 1303 | 1331 | 1334 | 1341 | 1341 | 1346 | 1344 | 1351 | 1348 | 1361 |

# exact | 1120 | 1182 | 1228 | 1271 | 1286 | 1339 | 1341 | 1352 | 1347 | 1368 |

# exact and correct | 1104 | 1166 | 1215 | 1257 | 1276 | 1329 | 1330 | 1341 | 1343 | 1360 |

# ties | 35 | 5 | 12 | 6 | 11 | 7 | 9 | 3 | 17 | 0 |

# TM-align correct | 1311 | 1347 | 1346 | 1350 | 1351 | 1354 | 1352 | 1353 | 1351 | 1361 |

# TM-align ties | 39 | 4 | 7 | 4 | 6 | 4 | 4 | 5 | 15 | 0 |

Boxplots of the percentage of removed targets at each iteration during triangle and pairwise dominance for the 1369 queries of the extended SCOPCath benchmark.

The difficulty to optimally compute a superfamily assignment using

Our exact classification is based on a well-known property of exact CMO computation: similar structures are quick to align and are usually computed exactly, whereas dissimilar structures are extremely slow to align and usually not exactly. Therefore, we remove dissimilar structures early using bounds. Distances between similar structures can then be computed (near-)optimal, and the resulting

Except for the case

While for the extended benchmark, max-CMO and TM-align have the same number of correct classifications for the best choice of value for

Moreover, although the current results suggest that, in terms of assignment accuracy, using only the nearest neighbor for classification works best, finding the

We show that our approach is beneficial for handling large datasets, the structures of which form clusters in some metric space, because it can quickly discard dissimilar structures using metric properties, such as triangle inequality. This way, the target dataset does not need to be reduced previously using a different distance measure, such as sequence similarity, which can lead to mistakes. Our classification is at all times based exclusively on structural distance.

Among the disadvantages of a heuristic approach for the task of large-scale structure classification, we can point to the observation that the obtained classifications are not stable. As versions of tools or random seeds change, the distance between structures may change, since the provable distance between two structures is not known. With these distance changes, also the entire classification may change. Such possible, unpredictable changes in classification contradict the essential use of an automatic classification as a reference. Furthermore, even if a given heuristic could be very fast, it always requires a pairwise number of comparisons for solving the classification problem by the

In this work, we introduced a new distance based on the CMO measure and proved that it is a true metric, which we call the max-CMO metric. We analyzed the potential of max-CMO for solving the

In summary, our approach provides a general solution to

We are grateful to Noël Malod-Dognin and Nicola Yanev for discussions and useful suggestions and to Sven Rahmann for providing computational infrastructure. We thank the reviewers for a careful reading and for the comments on this study.

Rumen Andonov, Gunnar W. Klau, Mathilde Le Boudic-Jamin and Inken Wohlers conceived the k-NN classification approach using triangle bounds and the max-CMO metric. Rumen Andonov, Hristo Djidjev and Gunnar W. Klau proved that max-CMO is a metric and other previously used measures not. Inken Wohlers implemented the classification algorithm. Mathilde Le Boudic-Jamin and Inken Wohlers conducted the computational experiments and prepared the results. The authors jointly examined the results and wrote the paper.

The authors declare no conflict of interest.