Research NoteSpeech GenerationInstruction Following

Bluebell: Bringing Instruction Following to Speech Generation

At Breeze Blue, we are rethinking what the future of audio intelligence should sound like.

Previous voice technologies excel at copying. But the next generation of Voice AI should do more: it should understand nuance, adapt across contexts, and remain controllable.

Bluebell is our first step in that direction — a generative speech model that brings state-of-the-art instruction following to speech generation.

Research Demo

Bluebell research motivation demo

01

Motivation

Speech generation has been optimized for realism. We believe it now needs to be optimized for intent.

The field has made enormous progress in naturalness, speaker consistency, and voice cloning. But high-fidelity reproduction is not the same as controllable generation. A voice that sounds perfect is not always a voice that feels right, fits the context, or follows direction.

Other generative domains have already moved in this direction. In language and visual generation, instruction following is now a core capability. Users expect models to respond to prompts with precision, flexibility, and range. Speech generation should be no different.

With Bluebell, we treat instruction following as a first-class capability for speech. The goal is not just to generate natural audio, but to generate speech that can be shaped by creative intent: a voice defined by persona, a performance shaped by context, and a delivery that changes with the scene.

02

Bluebell — A New Kind of Speech Model

At its core, Bluebell is an audio language model built on top of a pre-trained large language model. It is further trained on interleaved text-and-audio sequences, allowing text and audio to be modeled in a single stream. In this formulation, text provides instructions and script, while audio serves both as a conditioning signal and as a generation target.

This gives Bluebell strong instruction following ability in two settings: creating a new voice from text alone, and directing how an existing voice performs from a speech reference.

2.1

To Design: Create a new voice from a text prompt

Bluebell combines world knowledge with a strong understanding of natural language. It can follow open-ended text prompts — whether they describe the persona, tone, scenario, or style — and generate a voice that matches the intended character.

Acoustic-Parameter SpecificationDescriptive-Style DirectiveRole-Play
Character 1Character 2Character 3Character 4

Instructions

To Design: Create a new voice from a text prompt

We evaluate Bluebell on InstructTTSEval, a benchmark for measuring complex natural-language style control in speech generation. InstructTTSEval covers three types of instructions styles: Acoustic-Parameter Specification (APS), Descriptive-Style Directive (DSD), and Role-Play (RP), each with 1K test cases. Bluebell achieves state-of-the-art results on this benchmark, demonstrating strong ability to understand complex instructions and generate diverse voices. Unless otherwise stated, Gemini-3.1-pro is used as the evaluation judge. Based on our internal human evaluation, it provides more unbiased scoring than Gemini-2.5-pro.

2.2

To Direct: Instruct an existing voice to perform in a new speaking style

Voice direction emerges naturally from training on long-horizon audio data. Bluebell can follow natural-language instructions to alter the speaking style of a reference speech clip, while preserving the underlying speaker identity. Importantly, guidance strength can be adjusted to trade off between speaker consistency and instruction following.

LedaSadachbiaSchedarVindemiatrix

Reference Voice

Scene 1Scene 2Scene 3

Instructions

To Direct: Instruct an existing voice to perform in a new speaking style

We evaluate this setting on InstructTTSEval. For each of the 3K test cases, we randomly select one of the 30 preset Gemini TTS voices as the one-shot reference for Bluebell. We vary the guidance strength at test time to measure the trade-off between instruction following and speaker consistency. Bluebell achieves a state-of-the-art Pareto frontier in voice direction, with a clear scaling trend: at the same level of instruction following, larger models preserve speaker consistency better.

03

Try Bluebell

Speech generation is moving beyond reproduction toward creation, direction, and interaction. With Bluebell, we are taking an early step toward speech models that can follow intent, adapt to context, and unlock a wider range of voice experiences.

This is just the beginning. Our mission is to create the next generation of audio-native experiences through research and products. We're excited to share Bluebell with creators, developers, and teams exploring what those experiences could become.