Does AI Need to Learn Protein-ligand Interactions to Calculate? (II)


By further restricting the similarity of ligand small molecule scaffolds or protein sequences between the training set and the test set, the performance of the ACNN model has decreased significantly, indicating that the model may be “predicted” by simple ligand similarity or protein similarity Protein-ligand interaction, that is, similar ligands have similar binding activities, and similar proteins also have similar binding activities without having to learn complicated protein-ligand binding patterns. Between two points, AI will always walk the shortest straight line. This embodies the strong fitting ability of the neural network, and is good at discovering related relationships. However, such a model can only be accurately predicted in a scene that is very similar to the training set, and is difficult to generalize. It requires massive and diverse data to build a robust AI model. This problem is more difficult to overcome in areas where the crystal structure and activity measurement data of protein-ligand complexes are scarce and expensive. Therefore, when these models face real-world complex drug discovery and optimization scenarios that are vastly different from the training set.

Since the data from the experiment is not enough, can we use the computer-generated data to train the model? DUD/DUD-E as a commonly used molecular docking benchmark test set, not only contains 22,886 active small molecules (active), but also includes 1.41 million small molecules as negative controls (decoy). Decoy needs to have similar physical and chemical properties (molecular weight, net charge, etc.) as active, but different topologies (represented by molecular fingerprints) reduce false negatives. It overcomes the shortcomings of the previous benchmark test set, that is, the molecular docking software can distinguish active and decoy only through simple physical and chemical properties, and obtain a high score in the benchmark test set. It can be seen that the traditional molecular docking has also fallen into the trap (only relying on simple physical and chemical properties to rank), but it is crawled out by setting a reasonable negative control, and there is a reliable test set for objective evaluation methods. But the problem with AI is that we know what its pitfalls are? Similarity of data sets? The size of the data set? To answer this question, an AI model is trained on the DUD-E data set with a larger amount of data, with 6 physical and chemical properties (PROP) and molecular fingerprints (FP) as input features, and a random forest (RF) is trained on active and decoy classification.

When six physical and chemical properties are used as input features, if randomly divided into 3 groups for cross-validation (CV), the average AUC of the random forest at DUD-E 102 targets is 0.73, and the average top 1% active small molecule enrichment factor (EF1) is 22.2, very close to the performance of the molecular docking software in the DUD-E article (AUC: 0.76, EF1: 19.8). After removing small molecules with a molecular weight greater than 500 (bias has been reported), and group cross-validation by protein type (class AUC), AUC was reduced to 0.66, EF1 was reduced to 5.14, indicating that the model trained on DUD-E may learn Bias in physical and chemical properties. Including: 1) the active contains small molecules with a molecular weight greater than 500, and decoy is limited to drug-like small molecules, and the molecular weight is less than 500; 2) similar physical targets have similar physical and chemical properties, and the model can distinguish active and physical only by physical and chemical properties decoy.

When molecular fingerprints are used as input features, the random forest model can distinguish active and decoy well even when it is difficult to learn physical and chemical properties. Sorting the 84 features with high frequency in the molecular fingerprint and appearing differently in active and decoy, according to the frequency of occurrence in the ZINC database, it can be found that DUD-E has a bias in the topology (molecular fingerprint). Two reasons: 1) DUD-E selects small molecules from ZINC that are not similar to the active topology as decoy, so active and decoy have obvious differences as expected; 2) The distribution of Decoy and ZINC is closer, indicating that active and ZINC’s topology distribution is different. DUD-E is biased in physical and chemical properties and topology. As long as the model can learn these features explicitly or implicitly, even if the model is trained based on the docking complex, it is difficult to avoid being misled by the deviation.

In summary, the author believes that there is a lack of sufficient and unbiased data at this stage for training AI drug discovery and design models based on protein-ligand complex structures. Due to the powerful ability of the AI model to summarize correlations, in order to reasonably assess the AI model’s ability to predict protein-ligand binding strength and promote healthy development in this field, the following suggestions are proposed:

1) PDBbind will still be the most suitable experimental data set so far. However, when using the PDBbind training model, the protein alone and ligand alone models should be set as a baseline control to properly evaluate the reason for the model upgrade.

2) The protein similarity and ligand similarity between the training set and the test set should be systematically controlled to properly evaluate the generalization ability of the model.

3) The DUD-E data set should be used as an independent benchmark test set, not as a training set.

About Protheragen AI

Protheragen AI has proudly developed a unique artificial intelligence drug research and development platform to offer drug development solutions for worldwide customers, including but not limited to Drug R&D, Machine Translation, Intelligent Image Diagnosis, and Medical Therapy and Research System. Through big data analysis and other technical means, its AI platform can quickly and accurately mine and select the appropriate compounds or organisms. Compared with traditional methods, AI can save the cost of screening candidates by tens of billions every year. AI technology has been widely used in disease target prediction, high-throughput data analysis and system biology modeling.