Blog Article
From Prediction to Proof: Why AI/ML Drug Discovery Still Needs Experimental Data
Artificial intelligence, machine learning, and large language models (collectively referred to here as “AI/ML”) are genuinely reshaping how scientists imagine, prioritize, and advance new therapies. The promise of AI/ML is ubiquitous. We constantly see headlines touting how AI/ML will compress years of drug discovery into months and reduce the costs of identifying new therapies. However, there’s a fundamental truth that is not stressed enough: predictive models are only as powerful as the experimental data behind them. And right now, the data is the bottleneck.
Bridging the Gap Between Prediction and Biology
A predictive model can identify promising targets, rank candidate compounds, or predict binding interactions with a certain confidence. What the model cannot do is confirm that confidence within the noisy reality of biology. Validation requires experiments using assays designed with the computational question in mind and executed with a rigor that instills confidence to feed data back into the next model iteration.
This is where many AI/ML-driven programs quietly stall. The computational side is sophisticated, but the experimental infrastructure supporting it often isn’t built with the same intention. Bridging that gap should be a strategic imperative rather than an afterthought of a computational output.
Data Quality is the Limiting Factor for AI/ML
Computational power is rarely a bottleneck. Data quality almost always is. Part of the reason is the need for effective training sets that include sufficient true positives and true negatives and the contextualized metadata that can enhance model predictions. For example, understanding subtle variables, such as, but not limited to, reagent lot numbers, instrument settings, environmental conditions (temperature, humidity), compound origin, and experiment operator can influence assay performance and therefore any models trained on those datasets. When inconsistencies arise, the detailed level of traceability allows for rapid troubleshooting and ensures experimental signals remain interpretable for model development and optimization. For this reason, AI/ML teams are increasingly considering datasets as strategic assets that appreciate in value with every iteration rather than downstream considerations.
Speed and Rigor Are Not a Trade-off
When scientists are asked when they need their data, the most common answer is “Yesterday”. For AI/ML teams refining predictive models, speed matters even more. Rapid experimental feedback can determine whether a model iteration takes days, weeks, or even months. To support AI/ML teams, experimental workflows must be designed for adaptability and rapid iteration, enabling partners to move quickly from computational hypothesis to experimental validation. Moreover, rather than relying on rigid one-size-fits-all screening templates, each project should be tailored to the biological question being asked. That flexibility will generate the right data efficiently, without sacrificing the reproducibility and data fidelity that makes the results meaningful and valuable for modeling. And while speed is in mind, the goal is finding a balance of speed, flexibility, and meticulous attention to data fidelity to make partnerships productive and results trusted.
The Future: Closed Loops Between Models and Experiments
The next generation of drug discovery will belong to teams that master both sides of the equation: Smart algorithms and smart experiments that run in tight, continuous integration. We are moving towards closed-loop discovery, where computational predictions drive experimental design, and experimental results immediately sharpen model performance to accelerate the path towards viable drug candidates. The organizations that will lead this shift are building the infrastructure now, including experimental platforms designed for speed and adaptability, metadata frameworks that make datasets computable assets, and scientific teams that understand that they are not just running assays, they are training models that will define the next generation of medicine.
The therapies of the future will not emerge from code alone. They will come from the disciplined, iterative integration of computation and biology, where every prediction is tested, every result informs the next hypothesis, and the distance between insight and impact continues to shrink.