What is virtual screening in drug discovery?

Virtual screening uses computational models to evaluate millions or billions of chemical compounds against a biological target, predicting which molecules are most likely to bind effectively. This replaces the traditional high-throughput screening process, which is limited to testing hundreds of thousands of compounds physically and costs millions of dollars per campaign.

What is de novo drug design?

De novo drug design uses generative AI models to create entirely new molecular structures optimised for specific biological targets. Rather than searching existing chemical libraries, AI generates novel compounds with desired properties — binding affinity, selectivity, solubility, and safety — that have never existed before.

How AI Accelerates Drug Discovery: From 10 Years to 10 Months

Bringing a single drug to market costs an average of $2.6 billion, takes 10 to 15 years, and fails more than 90% of the time. These aren't edge cases — they're industry averages. The pharmaceutical pipeline is one of the most expensive, slow, and failure-prone systems in modern science. AI is rewriting every part of it.

This isn't theoretical. Companies are already moving from target identification to clinical candidate in under 18 months. The economics of drug discovery are being fundamentally restructured, and the organisations that adopt AI-native pipelines will dominate the next decade of pharmaceutical innovation.

What Does the Traditional Drug Discovery Pipeline Look Like?

The traditional drug discovery process follows a rigid, sequential pipeline that has remained largely unchanged for decades. Understanding each stage is critical to seeing where AI delivers the most transformative impact.

Stage 1: Target Identification (2-3 years). Researchers identify a biological target — typically a protein, enzyme, or receptor — that plays a causal role in a disease. This involves years of basic research, literature review, genetic association studies, and biological validation. Most targets never lead to druggable candidates.

Stage 2: Hit Discovery and Lead Identification (1-2 years). Once a target is validated, researchers screen compound libraries to find "hits" — molecules that interact with the target. Traditional high-throughput screening (HTS) can physically test 100,000 to 1 million compounds per campaign, but this covers only a fraction of the estimated 10⁶⁰ possible drug-like molecules in chemical space.

Stage 3: Lead Optimisation (1-2 years). Hit compounds are iteratively modified to improve potency, selectivity, metabolic stability, and safety. This is a cycle of synthesis, testing, analysis, and redesign — often requiring hundreds of compound iterations.

Stage 4: Preclinical Development (1-2 years). The optimised lead undergoes ADMET profiling (Absorption, Distribution, Metabolism, Excretion, Toxicity), formulation development, and animal testing to establish safety and efficacy before human trials.

Stage 5: Clinical Trials (6-10 years). Phase I (safety in healthy volunteers), Phase II (efficacy in patients), and Phase III (large-scale efficacy and safety) trials. This is where most drugs fail — approximately 90% of candidates entering Phase I never reach approval.

"The pharmaceutical industry spends $2.6 billion per approved drug. AI doesn't just reduce that number — it changes the entire equation."

How Does AI Transform Target Identification?

Target identification is where AI delivers its first — and perhaps most profound — intervention. Traditional target discovery relies on painstaking manual analysis of biological literature, genetic studies, and experimental validation. AI transforms this into a systematic, data-driven process.

Genomic and transcriptomic analysis using deep learning models can process entire multi-omics datasets — genomics, transcriptomics, proteomics, and metabolomics simultaneously — to identify disease-associated targets with far greater precision than any single-modality approach. Models like those built on Cell2Sentence frameworks can translate cellular behaviour into machine-readable representations, enabling AI to identify novel targets that human researchers would miss.

Knowledge graph mining extracts relationships from millions of scientific publications, patent filings, and clinical databases. Graph neural networks can traverse these knowledge graphs to identify non-obvious connections between genes, proteins, pathways, and diseases — surfacing targets that aren't visible through traditional literature review.

Causal inference models go beyond correlation to identify targets with genuine causal relationships to disease progression. This dramatically reduces the risk of pursuing targets that show association but not causation — a primary reason for late-stage clinical failure.

The result: target identification that previously took 2-3 years can be compressed to 2-3 months, with higher confidence in the biological validity of each target.

What Is Virtual Screening and Why Does It Matter?

Virtual screening is the computational equivalent of high-throughput screening — but operating at a scale that physical screening cannot touch. Instead of testing hundreds of thousands of compounds in a lab, virtual screening evaluates millions to billions of compounds computationally, predicting which molecules are most likely to bind a target effectively.

There are two primary approaches:

Structure-based virtual screening uses 3D models of the target protein (often generated by AlphaFold or similar structure prediction tools) to simulate how candidate molecules dock into the binding site. Physics-based scoring functions estimate binding affinity, and machine learning models trained on known binding data refine these predictions. Modern GPU-accelerated platforms can screen 1 billion compounds in days.

Ligand-based virtual screening works when the target structure is unknown or unreliable. Instead of modelling the target, it analyses known active compounds and uses similarity searching, pharmacophore modelling, or QSAR (Quantitative Structure-Activity Relationship) models to identify new candidates with similar properties.

The economics are staggering. A traditional HTS campaign costs $1-5 million and tests a limited chemical space. Virtual screening costs a fraction of that and explores chemical spaces orders of magnitude larger. More importantly, it generates ranked hit lists that are enriched for genuine actives — meaning fewer false positives in the wet lab.

In silico modelling capabilities are central to this transformation. The ability to build accurate computational representations of biological systems — from protein structures to cellular pathways — is the foundation upon which virtual screening delivers its value.

How Does De Novo Drug Design Create Molecules That Never Existed?

De novo drug design represents the most radical departure from traditional drug discovery. Instead of searching existing compound libraries, generative AI models create entirely new molecular structures optimised for specific targets and properties.

The technology draws on several AI architectures:

Variational autoencoders (VAEs) learn a continuous latent representation of molecular space, enabling smooth interpolation between known compounds and generation of novel structures in unexplored regions of chemical space.

Generative adversarial networks (GANs) pit a generator against a discriminator to produce increasingly realistic and drug-like molecules. The discriminator ensures generated molecules are synthetically feasible and pharmacologically relevant.

Reinforcement learning agents explore chemical space with reward functions defined by desired properties — binding affinity, selectivity, solubility, synthesisability. The agent learns to generate molecules that simultaneously optimise multiple objectives, solving the multi-parameter optimisation problem that makes traditional medicinal chemistry so difficult.

Diffusion models — the same architecture behind image generation breakthroughs — have been adapted for 3D molecular generation. These models can generate molecules directly in 3D space, respecting the geometric constraints of target binding sites.

Insilico Medicine demonstrated the power of this approach by using their generative platform to design a novel inhibitor for a fibrosis target, moving from target to preclinical candidate in just 18 months — a process that traditionally takes 4-5 years. The molecule they generated had never existed in any chemical database.

What Is ADMET Prediction and Why Do Most Drugs Fail Without It?

ADMET — Absorption, Distribution, Metabolism, Excretion, and Toxicity — is where promising drug candidates go to die. A molecule can bind its target perfectly but fail because it's not absorbed in the gut, is metabolised too quickly by the liver, doesn't reach the target tissue, or causes unacceptable toxicity.

Traditionally, ADMET properties are determined through extensive in vitro and in vivo testing during preclinical development. This is expensive, slow, and often reveals deal-breaking issues only after years of optimisation work.

AI-powered ADMET prediction flips this paradigm. Machine learning models trained on large datasets of known ADMET measurements can predict these properties from molecular structure alone — before a molecule is ever synthesised. This means:

Absorption prediction: Models estimate oral bioavailability, intestinal permeability, and P-glycoprotein substrate status
Distribution prediction: Blood-brain barrier penetration, plasma protein binding, and volume of distribution
Metabolism prediction: CYP450 enzyme interactions, metabolic stability, and potential drug-drug interactions
Excretion prediction: Renal clearance, half-life, and elimination pathways
Toxicity prediction: hERG channel liability (cardiac risk), hepatotoxicity, mutagenicity, and off-target effects

By front-loading ADMET prediction into the design phase, AI enables researchers to eliminate problematic candidates before investing in synthesis and testing. This alone can reduce preclinical attrition by 30-50%, saving millions of dollars and years of wasted effort per programme.

How Is AI Optimising Clinical Trials?

Clinical trials consume 60-70% of the total drug development budget and timeline. They also account for the majority of late-stage failures. AI is intervening at every level:

Patient stratification and enrichment. Machine learning models analyse genomic, proteomic, and clinical data to identify patient subpopulations most likely to respond to a therapy. Enriched trials have higher signal-to-noise ratios, smaller required sample sizes, and faster readouts. This is particularly powerful for oncology, where tumour heterogeneity means the "average" patient doesn't exist.

Biomarker discovery. AI identifies predictive biomarkers from multi-omics data — molecular signatures that predict treatment response, disease progression, or adverse events. These biomarkers enable precision medicine approaches and adaptive trial designs that adjust dosing or patient selection based on real-time data.

Trial design optimisation. AI models simulate trial outcomes under different designs — sample sizes, endpoints, randomisation strategies, interim analyses — to identify the design most likely to succeed. Bayesian adaptive designs, powered by AI, can reduce trial duration by 30-40% while maintaining statistical rigour.

Site selection and patient recruitment. Natural language processing analyses electronic health records and patient registries to identify optimal trial sites and eligible patients. This addresses one of the biggest operational bottlenecks: the average clinical trial takes 30% longer than planned due to slow recruitment.

Real-world evidence integration. AI extracts insights from real-world data sources — electronic health records, claims data, wearable devices — to supplement traditional trial data, support regulatory submissions, and monitor post-market safety.

"The question isn't whether AI will transform drug discovery. It's whether your organisation will be the one wielding it — or the one competing against those who do."

Which Companies Are Leading AI-Driven Drug Discovery?

The AI drug discovery landscape has matured rapidly, with several companies demonstrating clinical-stage results:

Recursion Pharmaceuticals has built one of the largest biological datasets in existence — over 19 petabytes of cellular imaging data — and uses it to train models that map the relationships between genes, compounds, and diseases at cellular resolution. Their platform has generated multiple clinical-stage programmes.

Insilico Medicine demonstrated the full AI-native pipeline by moving from target identification to Phase II clinical trials for its lead fibrosis programme in record time. Their Chemistry42 platform generates novel molecules, while their PandaOmics platform handles target discovery using multi-omics data.

Isomorphic Labs, DeepMind's drug discovery spinout, leverages AlphaFold's protein structure prediction capabilities for structure-based drug design. Their partnership agreements with Eli Lilly and Novartis — worth up to $3 billion — signal pharma's conviction in AI-native approaches.

Exscientia became the first company to put an AI-designed molecule into clinical trials and has demonstrated that AI can reduce hit-to-candidate timelines from 4.5 years to under 12 months.

These companies share a common thread: they don't use AI as a bolt-on to traditional processes. They've rebuilt the drug discovery pipeline from the ground up with AI at its core.

How Does NovaGenAI Approach AI-Driven Drug Discovery?

NovaGenAI occupies a distinctive position in this landscape. We don't compete with pharma companies on their own drug programmes. Instead, we build the custom AI infrastructure that enables pharmaceutical and biotech organisations to run AI-native drug discovery on their own proprietary data.

Our approach centres on three pillars:

Custom models on proprietary biological data. Generic AI models trained on public datasets deliver generic results. The competitive advantage in drug discovery comes from proprietary data — unique assay results, proprietary compound libraries, internal clinical data. We build bespoke in silico models fine-tuned on each client's specific biological data for their specific drug targets.

Multi-omics integration. Drug targets don't exist in isolation. Understanding a target requires integrating genomic, transcriptomic, proteomic, and metabolomic data into unified predictive models. Our platform handles the data engineering, normalisation, and model architecture required to make multi-omics AI actionable for drug discovery.

On-premise deployment with full data sovereignty. Pharmaceutical companies' most valuable asset is their proprietary biological data. Sending it to external cloud services is a non-starter for most serious drug programmes. We deploy AI infrastructure on-premise, ensuring that proprietary compound data, clinical results, and competitive intelligence never leave the client's environment.

The result is not a SaaS product or a generic platform. It's a purpose-built AI system that becomes an integrated part of each client's drug discovery operation — delivering custom predictions on custom data for custom targets.

What Does the Future of AI Drug Discovery Look Like?

The convergence of several trends is accelerating the transformation:

Foundation models for biology — large-scale models pre-trained on vast biological datasets, then fine-tuned for specific tasks — are achieving the same paradigm shift that GPT brought to language. Models like ESM (Evolutionary Scale Modeling) for proteins and Cell2Sentence for cellular data represent a new generation of biological AI that understands life at the molecular level.

Closed-loop autonomous labs combine AI-driven hypothesis generation with robotic experimental execution. The AI designs experiments, robots execute them, results feed back into the model, and the cycle repeats — 24/7, without human bottlenecks. Recursion and others are already operating these systems at scale.

Quantum-accelerated molecular simulation promises to unlock accurate modelling of molecular interactions that are computationally intractable on classical hardware. While still nascent, this will eventually enable first-principles drug design without empirical approximations.

The trajectory is clear: the companies that build AI-native drug discovery infrastructure today will define the pharmaceutical landscape of the next decade. The $2.6 billion, 15-year paradigm is ending. What replaces it will be faster, cheaper, and — most importantly — more likely to produce drugs that actually work.

The question for pharmaceutical and biotech organisations isn't whether to adopt AI. It's whether to build the capability now — or scramble to catch up when competitors are already in the clinic.

Don Calaki

CEO & Founder, NovaGenAI

Don leads NovaGenAI's mission to build production-grade AI systems for enterprises across Southeast Asia and Australia. Deep expertise in AI infrastructure, computational biotech, and enterprise deployment.

Frequently Asked Questions

Traditional drug discovery takes 10-15 years on average from initial target identification to regulatory approval. The process costs approximately $2.6 billion per approved drug, with a failure rate exceeding 90%. Most candidates fail in late-stage clinical trials, making the economics brutally inefficient.

AI compresses drug discovery by automating target identification through genomic analysis, screening millions of compounds computationally in days, generating novel molecules via de novo design, predicting ADMET properties before synthesis, and optimising clinical trial design. Companies have demonstrated target-to-clinical-candidate timelines of under 18 months.

Virtual screening uses computational models to evaluate millions or billions of chemical compounds against a biological target, predicting which molecules are most likely to bind effectively. It replaces traditional high-throughput screening, which is limited to testing hundreds of thousands of compounds physically at a cost of millions of dollars per campaign.

De novo drug design uses generative AI models — including variational autoencoders, GANs, reinforcement learning, and diffusion models — to create entirely new molecular structures optimised for specific biological targets. These molecules have never existed in any chemical database and are designed to simultaneously satisfy multiple drug-like properties.

NovaGenAI builds custom AI models trained on proprietary biological data for specific drug targets. Our approach combines in silico modelling, multi-omics data integration, and foundation model fine-tuning, deployed on-premise for full data sovereignty. We don't compete with pharma — we build the AI infrastructure that powers their discovery programmes.