Biology has a language. It's written in nucleotide sequences, expressed through protein structures, regulated by epigenetic marks, and manifested in the transcriptomic profile of every living cell.
For decades, the scientists who study this language relied on statistical models, manual annotation, and domain intuition to extract meaning from biological data. That era is ending.
At NovaGenAI, we build custom AI models that make biology machine-readable — not through traditional bioinformatics pipelines, but through the same class of large language models that have transformed how machines process human language.
How Can Large Language Models Process Biological Data?
Large language models are sequence-processing engines. They ingest ordered tokens, learn statistical relationships between them, and generate predictions. This architecture was designed for human language, but language isn't the only system that can be expressed as meaningful sequences.
A single human cell contains a transcriptomic profile — a quantitative record of which genes are active and at what levels. This profile can include expression data for over 20,000 genes. It defines the cell's identity, its state, its trajectory, and its potential.
Pioneering research like Google's Cell2Sentence has demonstrated that multi-dimensional biological data can be converted into structured sequences that large language models can ingest, learn from, and reason about. We're building on these foundations — training custom models on proprietary biological datasets to solve specific enterprise problems.
The conversion isn't trivial. Raw gene expression data is high-dimensional, sparse, and noisy. Our models apply transformations that preserve biological relationships while producing sequences with the statistical properties that make LLM training effective.
The result: AI models that have learned to "read" cells the way GPT learned to read English.
What Is the Difference Between Single-Omics and Multi-Omics AI?
Our first-generation models focus on transcriptomics — gene expression data from single-cell RNA sequencing. This alone opens significant capabilities: cell type classification, state prediction, and perturbation response modelling.
But a cell's identity isn't defined by transcriptomics alone. We're extending our models to integrate multiple omics layers:
- Genomics — DNA sequences and structural variants
- Epigenomics — chemical modifications regulating gene activity
- Proteomics — the protein expression landscape
- Metabolomics — small molecule profiles reflecting biochemical activity
The relationships between layers — how epigenetic state influences transcription, how transcription drives protein expression — are precisely the kind of complex, contextual dependencies that transformer architectures excel at modelling.
A model that understands only transcriptomics can tell you what a cell is doing. A model that understands the full multi-omics landscape can tell you why — and predict what it will do next.
Why Does NovaGenAI Focus on Stem Cell Biology?
Our primary data domain is stem cell biology, and this choice is deliberate.
Stem cells are uniquely information-rich. They exist in dynamic states — proliferating, differentiating, responding to signals, making fate decisions. A single stem cell culture can produce thousands of distinct cellular states.
This dynamism creates an extraordinarily rich training corpus for language models. Where a static tissue sample gives you a snapshot, stem cell data gives you narratives — trajectories of cellular change that the model can learn to predict.
The practical applications are immediate:
Cord blood and tissue banking. Understanding stem cell quality, potency, and differentiation potential at the molecular level — through AI rather than slow, expensive functional assays — transforms how these materials are characterised, stored, and matched to therapeutic applications.
Regenerative medicine. Predicting differentiation outcomes from early molecular signatures means identifying which culture conditions will produce the desired result before committing weeks of laboratory time.
Disease modelling. Patient-derived stem cells can be profiled and their responses to perturbations predicted in silico — computationally — before a single compound is pipetted.
How Can AI Models Accelerate Drug Discovery?
The pharmaceutical industry spends an average of $2.6 billion and 10 to 15 years to bring a single drug to market. The majority of that cost is failure.
Our models attack this failure rate at its root.
A model that has learned the language of cellular biology can predict how cells will respond to novel compounds without running physical experiments. It can screen millions of potential drug candidates in hours rather than months. It can identify off-target effects by modelling the full downstream biological cascade.
This isn't a replacement for wet lab biology. No computational model eliminates the need for physical validation. But it's a radical acceleration of the discovery funnel — entering the laboratory with higher-confidence candidates and fewer dead ends.
If our models reduce the failure rate of preclinical candidates by even a modest percentage, the savings per successful drug are measured in hundreds of millions of dollars.
What Infrastructure Powers Biological AI at NovaGenAI?
The same hardware that powers frontier language models — GPU clusters, high-bandwidth memory architectures, distributed training frameworks — is now available for biological sequence modelling. We train our models on NVIDIA DGX Spark infrastructure, leveraging the same transformer architectures that power the most capable language models in the world.
NovaGenAI operates at the intersection of two worlds. We have deep partnerships with NVIDIA for compute infrastructure, with Anthropic and OpenAI for foundational model research, and with Google Cloud for scalable data processing. We also work directly with stem cell laboratories and biobanks, giving us access to high-quality, clinically annotated biological data.
This dual positioning — feet in both the AI infrastructure world and the biological research world — is the entire thesis of the company.
What Are the Limitations of Biological AI Models?
Scientific integrity demands precision.
Our models are not diagnostic tools. They don't make clinical decisions. They're not replacements for clinical trials, regulatory approval, or physician expertise.
They are research platforms — new ways of representing and reasoning about biological data that accelerate discovery and improve prediction accuracy. Their outputs are hypotheses to be tested, not conclusions to act on without validation.
What Is NovaGenAI's Roadmap for Biological AI?
Our current models demonstrate strong performance on cell type classification, perturbation prediction, and cross-tissue generalisation. Over the next 12 months, we will:
- Scale training to larger and more diverse multi-omics datasets
- Publish benchmark results against established computational biology methods
- Deploy inference capabilities to partner laboratories for validation
- Extend the framework to additional biological domains
We're building at the frontier of two fields simultaneously — and that's exactly where the most important work happens.
NovaGenAI is a computational biotech and enterprise AI company operating across Malaysia, Singapore, and Australia. We build custom AI models for biological data, enterprise voice agents, and production AI systems.
To learn more about our computational biotech capabilities or partnership opportunities, visit our contact page or email us at enquiries@novagenai.com.my.