Cell2Sentence biology-as-language concept visualization
Computational Biotech

What is Cell2Sentence? Google's Biology-as-Language Breakthrough Explained

Don Calaki Don Calaki 9 min read

In 2023, a team at Google Research published a paper that quietly redefined how artificial intelligence could understand biology. Cell2Sentence demonstrated that single-cell gene expression data — the molecular fingerprint of every cell in the human body — could be converted into natural language sentences and processed by the same large language models that power ChatGPT and Google's Gemini. The implications are profound: biology, at its most fundamental level, can be treated as a language problem.

This article is the definitive guide to Cell2Sentence — what it is, how it works, why it matters, and how companies like NovaGenAI are building on these foundational concepts to deploy commercial AI systems trained on proprietary biological data.

What is Cell2Sentence?

Cell2Sentence is an academic research framework developed by Google Research that transforms single-cell RNA sequencing (scRNA-seq) data into natural language representations. Instead of feeding raw numerical gene expression matrices into specialised neural network architectures, Cell2Sentence converts each cell's expression profile into an ordered sentence of gene names — ranked from highest to lowest expression. This allows standard large language models (LLMs) to directly process, learn from, and reason about biological data.

The core insight is deceptively simple: a cell's identity is defined by which genes are active and how active they are. By ranking genes by expression level and writing them as an ordered sequence, you create a "sentence" that captures the essential biology of that cell. Just as word order carries meaning in English, gene order carries biological meaning in a Cell2Sentence representation.

How Does Single-Cell RNA Sequencing Work?

To understand Cell2Sentence, you first need to understand the data it transforms. Single-cell RNA sequencing (scRNA-seq) measures gene expression at the resolution of individual cells. Unlike bulk RNA sequencing — which averages expression across millions of cells and loses critical heterogeneity — scRNA-seq captures the unique molecular state of each cell.

A typical scRNA-seq experiment produces a gene expression matrix: rows represent individual cells (often tens of thousands to millions), columns represent genes (approximately 20,000 in the human genome), and each value represents how actively a given gene is being transcribed in that cell. The result is a massive, sparse, high-dimensional numerical matrix.

This data is extraordinarily rich. It reveals cell types, developmental trajectories, disease states, drug responses, and intercellular communication networks. But it's also extraordinarily complex. Traditional bioinformatics pipelines require extensive preprocessing, dimensionality reduction, clustering, and annotation — each step introducing assumptions and potential information loss.

How Does Cell2Sentence Convert Biology into Language?

Cell2Sentence's conversion process follows a remarkably elegant pipeline:

Step 1: Gene Expression Ranking. For each individual cell, all genes are ranked by their expression level from highest to lowest. A cell with high expression of CD3E, CD4, and IL7R (T-cell markers) would have these genes appear early in its sentence. A cell with high expression of CD19, MS4A1, and CD79A (B-cell markers) would produce a completely different sentence.

Step 2: Sentence Construction. The ranked gene names are concatenated into a natural language sentence. The resulting "sentence" might look like: "MALAT1 TMSB4X B2M RPL13 RPS27 CD3D IL32 LTB CD3E LDHB..." — a sequence that, to a biologist, immediately signals a T-lymphocyte.

Step 3: LLM Processing. These biological sentences are then fed into pre-trained or fine-tuned large language models. Because LLMs are fundamentally sequence-processing machines trained to understand patterns, relationships, and context within ordered text, they can learn the "grammar" of biology — which genes co-occur, which expression patterns define cell types, and which sequences indicate disease states.

DNA computational analysis visualization
Converting molecular biology into computational language — the foundation of a new paradigm

Why is the Biology-as-Language Paradigm a Breakthrough?

The significance of Cell2Sentence extends far beyond a clever data transformation trick. It represents a paradigm shift in how computational biology operates:

Leveraging existing AI infrastructure. Billions of dollars have been invested in building, training, and optimising large language models. Cell2Sentence allows the biology community to leverage this entire infrastructure — pre-trained weights, training frameworks, inference optimisation, and scaling laws — without building domain-specific architectures from scratch.

Transfer learning from language to biology. LLMs pre-trained on text corpora have already learned abstract concepts like sequence patterns, contextual relationships, and hierarchical structure. These capabilities transfer surprisingly well to biological sequences, giving biology models a head start that would take enormous biological datasets to learn from scratch.

Natural language interaction with biological data. Perhaps the most transformative implication: when biology is encoded as language, you can query it with language. Researchers could potentially ask an LLM, "What cell type does this expression profile represent?" or "Which genes distinguish this diseased cell from a healthy one?" and receive natural language answers grounded in the biological data.

Unifying multi-modal biological data. The language representation opens pathways to integrate different biological data types — genomics, transcriptomics, proteomics, clinical annotations — into a single representational framework. This is critical for multi-omics approaches that combine multiple biological data layers for deeper insights.

"Cell2Sentence proved that biology has a grammar. The next step is building systems that speak it fluently — on proprietary data, for commercial applications, at enterprise scale."

What Were the Key Results from Google's Research?

The Cell2Sentence paper demonstrated several compelling results that validated the biology-as-language approach:

Cell type classification. LLMs fine-tuned on Cell2Sentence representations achieved competitive or superior performance on cell type classification tasks compared to purpose-built single-cell analysis tools like scBERT and scGPT. This is remarkable because the LLMs were not designed for biology — they learned biological patterns through the language representation alone.

Cell sentence generation. The models could generate novel, biologically plausible cell sentences — essentially predicting what a cell's expression profile should look like given partial information. This capability has direct implications for in-silico modelling, where computational prediction replaces physical experimentation.

Natural language annotation. The framework enabled direct conversion between biological data and natural language descriptions: given a cell sentence, the model could generate a text description ("This is a CD4+ T helper cell"), and given a description, it could generate a plausible cell sentence.

Zero-shot and few-shot learning. Because the approach leverages pre-trained LLMs, it showed promising few-shot learning capabilities — performing meaningful classification with very limited biological training examples, a critical advantage when working with rare cell types or limited patient samples.

What Are the Limitations of Cell2Sentence?

Like all pioneering research, Cell2Sentence has important limitations that must be understood honestly:

Information loss through ranking. Converting continuous expression values to rank-ordered gene names discards magnitude information. A gene expressed at 10,000 counts versus 1,000 counts would occupy different rank positions, but the absolute and relative magnitudes — which carry biological meaning — are lost. This matters for quantitative analyses like differential expression and gene regulatory network inference.

Scalability constraints. Human cells express thousands of genes simultaneously. Cell2Sentence representations can become very long sequences, and LLMs have context window limitations. Processing 5,000+ gene "words" per cell across millions of cells presents computational challenges even for modern transformer architectures.

Academic proof-of-concept versus production deployment. Cell2Sentence was published as a research paper demonstrating feasibility. It was not designed as a production system for clinical or commercial use. The gap between academic validation on public datasets and production deployment on proprietary enterprise data is enormous — encompassing data quality, regulatory compliance, model governance, inference latency, and continuous monitoring.

Public dataset dependence. The published experiments used publicly available scRNA-seq datasets. The performance characteristics on private, proprietary datasets — which often have different quality profiles, batch effects, and biological contexts — remain unvalidated in the public literature.

How Does NovaGenAI Build on These Foundations?

NovaGenAI does not use Cell2Sentence. We build on the foundational insight that Cell2Sentence and related research validated: biological data can be represented and processed using language model architectures. From that starting point, we diverge entirely into proprietary territory.

Custom models on proprietary data. NovaGenAI trains custom foundation models on proprietary biological datasets provided by our enterprise clients — stem cell laboratories, biobanks, cord blood banks, and healthcare systems. This data has never been published, never been part of any public dataset, and contains biological insights that public models simply cannot capture.

Enterprise-grade training infrastructure. We leverage the full NVIDIA ecosystem for model training and deployment: NVIDIA NeMo for foundation model training, NIM for optimised inference, CUDA for GPU-accelerated computation, and TensorRT for production inference optimisation. This stack delivers the performance and reliability that enterprise deployment demands.

Beyond transcriptomics. While Cell2Sentence focused on scRNA-seq data (transcriptomics), NovaGenAI's models integrate multiple omics layers — genomic variants, epigenetic modifications, proteomic measurements, and clinical metadata. This multi-modal approach captures biological complexity that single-omics representations miss.

Production deployment for regulated industries. Our models are deployed in environments that meet healthcare regulatory requirements. This means on-premise or hybrid infrastructure, data sovereignty guarantees, audit trails, model versioning, and continuous monitoring — none of which are addressed by academic research papers.

Stem cell laboratory
From laboratory bench to enterprise AI — bridging the gap between research and deployment

What is the Difference Between Academic Research and Commercial Deployment?

This distinction is critical and often misunderstood. Academic research like Cell2Sentence asks: "Is this possible?" Commercial deployment asks: "Can this work reliably, at scale, on our data, within our regulatory constraints, and deliver measurable business value?"

The gap includes:

NovaGenAI exists precisely to bridge this gap. We take the foundational science — proven by researchers at Google, Stanford, the Broad Institute, and others — and build production-grade systems that deliver commercial value on proprietary data.

What Does This Mean for the Future of Biology?

The biology-as-language paradigm, demonstrated by Cell2Sentence and advanced by companies building on these foundations, points toward a future where:

Every cell becomes queryable. Biological databases become conversational. Researchers and clinicians interact with cellular data through natural language, dramatically accelerating discovery and clinical decision-making.

In-silico experiments replace wet lab first passes. Before running expensive, time-consuming physical experiments, researchers can simulate biological outcomes computationally — screening millions of hypotheses in hours rather than months.

Personalised medicine becomes data-driven. Models trained on diverse biological datasets can predict individual patient responses to therapies based on their unique cellular profiles, moving medicine from population-level statistics to individual-level precision.

Drug discovery accelerates by orders of magnitude. The combination of biology-as-language models, AI-driven drug discovery, and multi-omics integration compresses timelines from decades to years — potentially saving billions in development costs and, more importantly, lives.

Cell2Sentence was the proof of concept. What comes next — built on proprietary data, deployed at enterprise scale, optimised for real-world biological problems — will define the next decade of computational biology.

Frequently Asked Questions

Cell2Sentence is a research framework developed by Google Research that converts single-cell RNA sequencing data into natural language sentences. It ranks genes by expression level within each cell and represents them as ordered word sequences, enabling standard large language models to process biological data without specialised architectures.
Cell2Sentence was developed by researchers at Google Research and published as an academic paper. It is a research contribution to computational biology, not a commercial product.
Cell2Sentence itself is a data representation method, not a drug discovery tool. However, the biology-as-language paradigm it demonstrates opens pathways for downstream applications including drug target identification, cell state prediction, and virtual screening when combined with appropriate models and proprietary datasets.
NovaGenAI builds custom foundation models trained on proprietary biological datasets from stem cell laboratories, biobanks, and healthcare enterprises. While inspired by foundational research like Cell2Sentence, NovaGenAI's models are purpose-built for commercial deployment using the full NVIDIA training stack (NeMo, NIM, CUDA, TensorRT).
Cell2Sentence is Google's academic research demonstrating that biology can be encoded as language. NovaGenAI builds production-grade, enterprise-deployed AI models trained on proprietary biological data for specific commercial use cases — bridging the gap between academic proof-of-concept and real-world deployment in regulated industries.

Related Articles

In-Silico Modelling
Computational Biotech

In-Silico Modelling: How AI is Replacing Physical Experiments in Drug Discovery

Feb 28, 2026 · 10 min
Multi-Omics AI
Computational Biotech

Single-Omics vs Multi-Omics AI: What's the Difference and Why Does It Matter?

Feb 28, 2026 · 10 min