The Challenge: Accuracy Over Artistry
Text-to-image generation has become a familiar capability — models like Stable Diffusion and DALL·E can produce beautiful, creative visuals from natural language descriptions. But scientific illustration is an entirely different problem. When a researcher describes a lab protocol or a biological mechanism, the resulting figure must be correct, not merely aesthetic.
Accuracy in scientific figures operates on two distinct layers:
① Scientific Accuracy
The illustration must faithfully represent the science — the correct sequence of events in a protocol, the right components in a signaling pathway, the proper flow of a biological process. Getting this wrong doesn't just look bad; it communicates false information.
② Style & Composition Accuracy
Objects must appear in visually correct relationships. A mouse model shouldn't be rendered purple. A beaker must sit on top of a hot plate, not beneath it. These spatial and stylistic conventions are deeply encoded in scientific communication.
Pixel-level generative models fail on both fronts — they optimize for visual plausibility, not semantic correctness. This demands an entirely different approach.
The Solution: Two-Stage Structured Generation
Rather than generating figures end-to-end, the system breaks the problem into two distinct stages, each solving one layer of the accuracy problem.
Stage 1: Composable Layout Patterns
The core insight of Stage 1 is that scientific figures are not arbitrary — they follow conventions. A mouse is always depicted in a certain orientation. A centrifuge tube always sits upright. Objects in a protocol flow in a consistent direction. These conventions can be pre-encoded as a library of object–object interaction maps.
When the LLM processes a description, it doesn't generate a layout from scratch — it matches the described elements to existing patterns in this library, then composes them together.
Think of building with Lego: you don't re-design every brick for each build from scratch. Each brick is a reusable, well-defined unit — and complex structures emerge from snapping the right pieces together in the right order. Our layout library works the same way: it stores small, verified compositional patterns (e.g., "tube on hot plate," "arrow from mouse to sample") that can be combined and chained to represent complex experimental workflows. Complexity emerges from correct combinations of correct primitives.
Stage 2: Reversed Detection for Style Consistency
Traditional computer vision detection works in one direction: given an image, find and localize objects within it (image → bounding boxes). Stage 2 inverts this: given a set of bounding boxes from the layout map, retrieve the correct pre-drawn icon to place in each slot (bounding box → icon).
This "reversed detection" playbook sidesteps the fundamental problem with pixel-level generation: stylistic inconsistency. When different objects are generated independently, they rarely look like they belong in the same illustration — line weights differ, color palettes clash, perspective varies. By drawing from a unified library of pre-rendered assets, every element in the output figure shares the same visual language by construction.
The result is a system that produces figures that are both scientifically grounded (via Stage 1's layout constraints) and visually coherent (via Stage 2's library-driven fill) — properties that pixel-wise generation simply cannot guarantee.
Impact
This system was shipped to 100% of the Biorender user base — reaching researchers, educators, and science communicators who rely on accurate visual communication. It represents a principled alternative to generative image models for domains where accuracy is non-negotiable, demonstrating how structured, multi-stage AI pipelines can outperform end-to-end generation for specialized professional tasks.