← Back to Blog

Computational Biotech

Custom AI Models for Biology: How NovaGenAI Is Making Biological Data Machine-Readable

Don Calaki · 28 February 2026 · 9 min read

Biology has a language. It's written in nucleotide sequences, expressed through protein structures, regulated by epigenetic marks, and manifested in the transcriptomic profile of every living cell.

For decades, the scientists who study this language relied on statistical models, manual annotation, and domain intuition to extract meaning from biological data. That era is ending.

At NovaGenAI, we build custom AI models that make biology machine-readable — not through traditional bioinformatics pipelines, but through the same class of large language models that have transformed how machines process human language.

How Can Large Language Models Process Biological Data?

Large language models are sequence-processing engines. They ingest ordered tokens, learn statistical relationships between them, and generate predictions. This architecture was designed for human language, but language isn't the only system that can be expressed as meaningful sequences.

A single human cell contains a transcriptomic profile — a quantitative record of which genes are active and at what levels. This profile can include expression data for over 20,000 genes. It defines the cell's identity, its state, its trajectory, and its potential.

Pioneering research like Google's Cell2Sentence has demonstrated that multi-dimensional biological data can be converted into structured sequences that large language models can ingest, learn from, and reason about. We're building on these foundations — training custom models on proprietary biological datasets to solve specific enterprise problems.

The conversion isn't trivial. Raw gene expression data is high-dimensional, sparse, and noisy. Our models apply transformations that preserve biological relationships while producing sequences with the statistical properties that make LLM training effective.

The result: AI models that have learned to "read" cells the way GPT learned to read English.

What Is the Difference Between Single-Omics and Multi-Omics AI?

Our first-generation models focus on transcriptomics — gene expression data from single-cell RNA sequencing. This alone opens significant capabilities: cell type classification, state prediction, and perturbation response modelling.

But a cell's identity isn't defined by transcriptomics alone. We're extending our models to integrate multiple omics layers:

The relationships between layers — how epigenetic state influences transcription, how transcription drives protein expression — are precisely the kind of complex, contextual dependencies that transformer architectures excel at modelling.

A model that understands only transcriptomics can tell you what a cell is doing. A model that understands the full multi-omics landscape can tell you why — and predict what it will do next.

Why Does NovaGenAI Focus on Stem Cell Biology?

Our primary data domain is stem cell biology, and this choice is deliberate.

Stem cells are uniquely information-rich. They exist in dynamic states — proliferating, differentiating, responding to signals, making fate decisions. A single stem cell culture can produce thousands of distinct cellular states.

This dynamism creates an extraordinarily rich training corpus for language models. Where a static tissue sample gives you a snapshot, stem cell data gives you narratives — trajectories of cellular change that the model can learn to predict.

The practical applications are immediate:

Cord blood and tissue banking. Understanding stem cell quality, potency, and differentiation potential at the molecular level — through AI rather than slow, expensive functional assays — transforms how these materials are characterised, stored, and matched to therapeutic applications.

Regenerative medicine. Predicting differentiation outcomes from early molecular signatures means identifying which culture conditions will produce the desired result before committing weeks of laboratory time.

Disease modelling. Patient-derived stem cells can be profiled and their responses to perturbations predicted in silico — computationally — before a single compound is pipetted.

How Can AI Models Accelerate Drug Discovery?

The pharmaceutical industry spends an average of $2.6 billion and 10 to 15 years to bring a single drug to market. The majority of that cost is failure.

Our models attack this failure rate at its root.

A model that has learned the language of cellular biology can predict how cells will respond to novel compounds without running physical experiments. It can screen millions of potential drug candidates in hours rather than months. It can identify off-target effects by modelling the full downstream biological cascade.

This isn't a replacement for wet lab biology. No computational model eliminates the need for physical validation. But it's a radical acceleration of the discovery funnel — entering the laboratory with higher-confidence candidates and fewer dead ends.

If our models reduce the failure rate of preclinical candidates by even a modest percentage, the savings per successful drug are measured in hundreds of millions of dollars.

What Infrastructure Powers Biological AI at NovaGenAI?

The same hardware that powers frontier language models — GPU clusters, high-bandwidth memory architectures, distributed training frameworks — is now available for biological sequence modelling. We train our models on NVIDIA DGX Spark infrastructure, leveraging the same transformer architectures that power the most capable language models in the world.

NovaGenAI operates at the intersection of two worlds. We have deep partnerships with NVIDIA for compute infrastructure, with Anthropic and OpenAI for foundational model research, and with Google Cloud for scalable data processing. We also work directly with stem cell laboratories and biobanks, giving us access to high-quality, clinically annotated biological data.

This dual positioning — feet in both the AI infrastructure world and the biological research world — is the entire thesis of the company.

What Are the Limitations of Biological AI Models?

Scientific integrity demands precision.

Our models are not diagnostic tools. They don't make clinical decisions. They're not replacements for clinical trials, regulatory approval, or physician expertise.

They are research platforms — new ways of representing and reasoning about biological data that accelerate discovery and improve prediction accuracy. Their outputs are hypotheses to be tested, not conclusions to act on without validation.

What Is NovaGenAI's Roadmap for Biological AI?

Our current models demonstrate strong performance on cell type classification, perturbation prediction, and cross-tissue generalisation. Over the next 12 months, we will:

We're building at the frontier of two fields simultaneously — and that's exactly where the most important work happens.


NovaGenAI is a computational biotech and enterprise AI company operating across Malaysia, Singapore, and Australia. We build custom AI models for biological data, enterprise voice agents, and production AI systems.

To learn more about our computational biotech capabilities or partnership opportunities, visit our contact page or email us at enquiries@novagenai.com.my.

Share in 𝕏

Frequently Asked Questions

How can large language models process biological data?

LLMs are sequence-processing engines. Biological data like gene expression profiles can be converted into structured sequences that models can learn from. Google's Cell2Sentence demonstrated this approach; NovaGenAI trains custom models on proprietary biological datasets using similar principles.

What is the difference between single-omics and multi-omics AI?

Single-omics models focus on one data type like transcriptomics, enabling cell classification and state prediction. Multi-omics models integrate genomics, epigenomics, proteomics, and metabolomics to understand why cells behave certain ways and predict future behaviour.

How can AI models accelerate drug discovery?

AI models that understand cellular biology can predict compound responses computationally, screen millions of candidates in hours, and identify off-target effects — radically accelerating the discovery funnel before entering the wet lab.

Does NovaGenAI's biological AI replace clinical trials?

No. Our models are research platforms that generate hypotheses for testing. They are not diagnostic tools, and they don't replace clinical trials, regulatory approval, or physician expertise.

Related Articles

Enterprise Infrastructure

Why On-Premise AI Is the Future for Regulated Industries

27 February 2026 · 8 min read