Optimized Storage and Querying for Matched Molecular Pairs: The MMP Database / by Raphaël Berthier

Matched Molecular Pairs Analysis (MMPA) is a widely used method to analyse molecular structure, property and activity relationships. It allows to extract frequently observed transformations, and associate property or activity changes to these structural modifications. One of the main advantages of MMPA lies in its intuitiveness and the easy interpretation of results, therefore making it one of the most important tools in the medicinal chemist toolbox.

In this blogpost we will present our latest developments to handle more efficiently large MMP datasets and databases, how to keep them up to date and how to develop responsive applications on top of that. 

What are Matched Molecular Pairs?

A Matched Molecular Pair, or MMP, can be defined as a pair of molecules sharing a common structural part, and only differing through a well-defined structural modification. Thus, an MMP defines a transformation between two molecules. On the right is depictured an MMP, with the common structural part (called Core, coloured in grey), and the two variable parts (called Fragments, coloured in red) defining the transformation in the MMP. Molecules within a given MMP can therefore be converted to one another by chemically modifying a specific fragment into another.

How are MMPs used?

MMPA relies on finding all MMPs within a given dataset, computing the corresponding change in property or activity between the two molecules of each pair, and associate this difference to a chemical transformation. A given transformation, if observed a sufficient number of times, can give an overview of what could happen if it is applied to other molecules: the robustness of predictions that can be made with MMPA is therefore directly related to the amount of data available. Thus, MMPA is increasingly useful with increasing dataset sizes.

Since we’ve been working with MMPs for quite a while now, you can check out a more complete overview of what are MMPs and how they are used by computational and medicinal chemist in this previous Chemistry Collection release article, presenting Matched Molecular Pair Analysis Components for Target Activity Prediction on Small Datasets The “Context” of an MMP will not be detailed here but is thoroughly explained in the linked article.

MMPs issues in pharmaceutical R&D

When generating MMPs on large corporate databases large amounts of data are inherently generated. This data must be properly organized for drug discovery scientists to be able to use it in an efficient manner.  From what we’ve heard and seen, the methods used to store MMPs have one or more limitations: either the high response time for a given query on MMPs (e.g. What is the general effect of this transformation regarding this chemical property?), the time to update the MMPs with new molecules (which usually requires computing all MMPs for each update), or the ability to store several representations of the context for a given MMP. Methods used today result in poor performance for MMPA-based applications. As a result only part of the molecules that should be contained in such applications are contained. Furthermore they are not updated on a fast and regular pace. Last, MMPs are usually calculated “on-the-fly” on smaller datasets: a more convenient solution but far from being time-efficient and having robust predictions with high statistical significance.

 The Matched Molecular Pairs database

For the past months, we have been developing and testing a new Matched Molecular Pair Database model, and the Pipeline Pilot Toolbox that goes along with it. The Oracle database model delivered with the Discngine Chemistry Collection enables MMP storage for millions of molecules, frequent and fast updates, while retaining time-efficient MMP-related queries. Additionally, several specific chemical context types for each MMP can be added independently, and at any time. This database can then easily be integrated in existing application (e.g. Pipeline Pilot, TIBCO Spotfire, Web Service integration, Database integration…)

Retrieving all MMPs for a given molecule, a fragment, or a transformation with or without context specification is now only a matter of seconds, even for datasets of millions of molecules. Updating your MMPs with new structures is now done in a few seconds. It goes without saying that this includes both multi-cut MMPs and MMPs with hydrogen fragments.

Moreover, now you can have a much finer control on the identification of MMPs (i.e. which MMPs are to be formed and which are not, the molecular bonds to ignore, the size of fragments and cores…), and can use specific MMP identification rules (e.g. RECAP-MMPs) . Specific core chemical contexts for each MMP can also be generated independently: in addition to the classic core chemical context representation provided by Pipeline Pilot, this Discngine Chemistry Collection release also includes the availability to generate and store multiple context representations (e.g. Pharmacophore Graph context representation), including your own definition of context, for each MMP.

A few benchmarks values

From the benchmarks we’ve run on a MMP database of 1 million molecules (with more than 90 Million MMPs generated): the average response time for querying all MMPs for a single molecule was 0.5 seconds, the average response time for querying the database for a given transformation is 0.03 seconds, and the time for updating the MMP Database with 2,000 new molecules never took more than 1 minute.

Key Benefits

This Discngine Chemistry Collection release with its new optimized storage and querying for MMPs feature can allow you to boost the performances on your MMPA-based tools with a finer control on MMP identification, to gain statistical significance on your MMPA property predictions, and to perform MMPA on both your in-house and your vendors’ catalogues structures. Moreover, it allows to generate new context representations of your own, and to perform MMPA on the non-registered medicinal chemist’s ideas. 

The Matched Molecular Pair database can also be used to generate Matched Molecular Series (i.e., a series of analogue compounds having a common substructure and different fragments, thus, all being MMPs with one another), and identify common scaffolds in a given series of analogue compounds. With the use of MMPs and MMSs, navigating through your SAR and making property/activity predictions with a wide range of MMPA methodologies becomes now much easier, and with improved statistical robustness.

Integration examples

Here a few integration examples that we’ve done with our MMP database, Pipeline Pilot, Oracle and TIBCO Spotfire.