Text-to-Figure for Scientific Illustration

The Challenge: Accuracy Over Artistry

Text-to-image generation has become a familiar capability — models like Stable Diffusion and DALL·E can produce beautiful, creative visuals from natural language descriptions. But scientific illustration is an entirely different problem. When a researcher describes a lab protocol or a biological mechanism, the resulting figure must be correct, not merely aesthetic.

Accuracy in scientific figures operates on two distinct layers:

① Scientific Accuracy

The illustration must faithfully represent the science — the correct sequence of events in a protocol, the right components in a signaling pathway, the proper flow of a biological process. Getting this wrong doesn't just look bad; it communicates false information.

② Style & Composition Accuracy

Objects must appear in visually correct relationships. A mouse model shouldn't be rendered purple. A beaker must sit on top of a hot plate, not beneath it. These spatial and stylistic conventions are deeply encoded in scientific communication.

Pixel-level generative models fail on both fronts — they optimize for visual plausibility, not semantic correctness. This demands an entirely different approach.

The Solution: Two-Stage Structured Generation

Rather than generating figures end-to-end, the system breaks the problem into two distinct stages, each solving one layer of the accuracy problem.

Stage 1

LLM Layout Matching

An LLM reads the natural language description and maps it to a set of pre-stored object–object interaction patterns. These patterns encode spatial relationships — what sits above what, what flows into what, which objects typically appear together — preserving both scientific and compositional conventions.

→

Stage 2

Icon Fill via Reversed Detection

The layout map's bounding boxes are filled with pre-drawn, style-consistent icons retrieved from a curated asset library. No pixels are generated — instead, a reversed detection approach maps each layout slot to the correct existing icon, guaranteeing style uniformity across the entire figure.

Stage 1: Composable Layout Patterns

The core insight of Stage 1 is that scientific figures are not arbitrary — they follow conventions. A mouse is always depicted in a certain orientation. A centrifuge tube always sits upright. Objects in a protocol flow in a consistent direction. These conventions can be pre-encoded as a library of object–object interaction maps.

When the LLM processes a description, it doesn't generate a layout from scratch — it matches the described elements to existing patterns in this library, then composes them together.

The Lego Analogy

Think of building with Lego: you don't re-design every brick for each build from scratch. Each brick is a reusable, well-defined unit — and complex structures emerge from snapping the right pieces together in the right order. Our layout library works the same way: it stores small, verified compositional patterns (e.g., "tube on hot plate," "arrow from mouse to sample") that can be combined and chained to represent complex experimental workflows. Complexity emerges from correct combinations of correct primitives.

Stage 2: Reversed Detection for Style Consistency

Traditional computer vision detection works in one direction: given an image, find and localize objects within it (image → bounding boxes). Stage 2 inverts this: given a set of bounding boxes from the layout map, retrieve the correct pre-drawn icon to place in each slot (bounding box → icon).

This "reversed detection" playbook sidesteps the fundamental problem with pixel-level generation: stylistic inconsistency. When different objects are generated independently, they rarely look like they belong in the same illustration — line weights differ, color palettes clash, perspective varies. By drawing from a unified library of pre-rendered assets, every element in the output figure shares the same visual language by construction.

The result is a system that produces figures that are both scientifically grounded (via Stage 1's layout constraints) and visually coherent (via Stage 2's library-driven fill) — properties that pixel-wise generation simply cannot guarantee.

Impact

This system was shipped to 100% of the Biorender user base — reaching researchers, educators, and science communicators who rely on accurate visual communication. It represents a principled alternative to generative image models for domains where accuracy is non-negotiable, demonstrating how structured, multi-stage AI pipelines can outperform end-to-end generation for specialized professional tasks.

Text-to-Figure forScientific Illustration