
Seedance 2.0 prompts: the complete writing guide
Master Seedance 2.0 prompts with the official three-part formula, multi-modal reference syntax (@image, @video, @audio), and mode-specific templates that ship.
Most "AI video prompt" articles you'll find online are written by people who type "cinematic shot of a wolf in the snow" and call it a tutorial. That's not what Seedance 2.0 prompts look like in practice. The model has a specific multi-modal grammar (@image1, @video2, @audio1, @character:<id>), and Volcengine, which owns the model, ships an official prompt guide structured around five distinct categories[1]. If your prompts don't match that structure, you'll burn credits re-rolling near-misses.
This is the long-form Seedance 2.0 prompts guide I wish I had when I started. It covers the official text formula, the reference-syntax that controls multi-image and multi-video runs, mode-specific templates for the four real generation paths (text-to-video, image-to-video, reference-to-video, video editing), and the failure modes that send beginners back to the queue. Verified against ByteDance/Volcengine's API spec and the live seedance2.so studio configuration.
TL;DR
- Seedance 2.0 prompts follow a three-part formula: subject + action, then environment / lighting / style, then camera or audio cues[1].
- The model accepts up to 9 reference images, 3 reference videos, and 3 audio files in a single request[2]. Reference them inline as
εΎη1/θ§ι’2/ι³ι’1(Chinese) or@image1/@video2/@audio1(theseedance2.soshorthand)[3]. - Prompt length cap: β€ 500 Chinese characters or β€ 1,000 English words. Anything longer dilutes attention and the model starts ignoring details[2].
- Seedance 2.0 supports prompts in English, Chinese, Japanese, Indonesian, Spanish, and Portuguese, older Seedance variants only support English and Chinese[2].
- Upstream there are only three real modes: text-to-video, image-to-video (first frame or first+last frame), and multi-modal reference. "Video edit" and "video extend" are reference-to-video usage patterns, not separate models[2].
- Seedance 2.0 will refuse real human-face references, it expects either a generated portrait, a pre-authorized asset, or one of the platform-provided virtual avatars[2].
The three-part Seedance 2.0 prompt formula
Volcengine's official prompt guide lays out the structure as three composable blocks[1]. You don't need to fill every block every time, but stacking them in this order gives the model the cleanest signal.
Block 1: subject and action. Who is in the scene and what they are doing. This is the logical anchor. "A woman" tells the model nothing. "A tall woman in a long charcoal coat striding across a wet stone bridge" gives it a subject, a posture, and a movement vector.
Block 2: environment, lighting, style. Where it happens, what the light looks like, and the visual register. "At dusk, streetlights reflecting off rain-slick cobblestones, desaturated teal-and-amber color grade" is doing real work. Skip this block and the model defaults to a medium shot with neutral lighting and zero stylistic point of view.
Block 3: camera language and audio cues. How the camera moves and what you hear. "Slow dolly forward, shallow depth of field, ambient piano underscoring" turns a generic shot into a directed one. Seedance 2.0 generates native audio with lip-synced dialogue in 8+ languages, so audio cues belong in the prompt, not as an afterthought.
A clean three-block prompt:
A tall woman in a charcoal coat strides across a rain-slick stone bridge.
Dusk light, streetlights reflecting on cobblestones, desaturated teal-and-amber grade.
Slow dolly forward following the subject. Distant traffic and soft rain on stone.That's three sentences and it covers all three blocks. The model has everything it needs.
Front-load the high-information words
Seedance 2.0 reads left to right with diminishing attention. The first sentence carries the most weight, the second is filled in around it, and anything after the third is "details to use if there's room." Put your hardest constraints, subject identity, key action, primary location, in the opening sentence. Stylistic flourishes go later.
This isn't a vibe. It maps to how the model balances prompt tokens against attention budget under the documented length caps (500 Chinese characters / 1,000 English words)[2]. Past the cap, prompts get aggressively summarized internally, and "summarized" usually means losing the specifics you cared about.
Pick one style and commit
Mixing "Pixar 3D animation, gritty 35mm film grain, watercolor wash" inside a single prompt is the fastest way to get visual mush. The model has to reconcile three contradictory aesthetic signals and the result is usually a flat default. Pick one (say, Pixar 3D animation or gritty 35mm film, heavy grain or loose watercolor wash) and lean into it.
For text-to-video work in particular, style coherence is what separates "actually usable for a campaign" from "fun to look at once."
The reference syntax that nobody documents clearly
Here is the part most third-party guides get wrong. Seedance 2.0's reference-to-video mode (Volcengine calls it ε€ζ¨‘ζεθηθ§ι’, "multi-modal reference video generation") uses an explicit numeric pointer system in the prompt itself[3].
The official Volcengine syntax is Chinese square-bracket numbering: εΎη1, εΎη2, ..., εΎη9 for images; θ§ι’1, θ§ι’2, θ§ι’3 for videos[3]. On seedance2.so, the studio surfaces an English-friendly shorthand mapped to the same upstream contract, @image1 through @image9, @video1 through @video3, and @audio1 through @audio3[4]. They produce identical outputs; pick whichever reads cleaner to you.
The point: reference-to-video without explicit pointers is just a vague hint to the model. With pointers, you're telling it exactly which input slot maps to which idea in the prompt.
Multi-image references: the shopping-list pattern
Volcengine's recommended template for multi-image references[3]:
Reference @image1, @image2, @image3 (the camera), put it on a white desk.
Slowly orbit the camera, showing front, side, and back. White seamless backdrop.The number-to-input mapping is positional. The first image you upload is @image1, the second is @image2, and so on. This is non-negotiable, there's no "name" field on uploads, just order. If you re-upload the same image second instead of first, your @image1 reference now points at a different image and the prompt breaks silently.
The official Volcengine guide gives this composed example for using three image inputs to define subject, outfit, and product respectively[3]:
A boy wearing glasses and a blue T-shirt next to a corgi puppy, sitting on a lawn,
3D cartoon style.versus the structured version:
[image 1] a boy wearing glasses and a blue T-shirt and [image 2] the corgi puppy,
sitting on [image 3] the lawn, 3D cartoon style.Both work. The second yields measurably tighter adherence to the input images. If you care about commercial fidelity, product photography, character continuity across shots, use the explicit-pointer form every time.
Video references: action, camera, FX
The same pattern applies to video inputs[3]. Volcengine documents three distinct ways to use a reference video:
| What you want from the reference | Prompt template |
|---|---|
| Borrow the action (movement, choreography) | Reference the action in @video1, generate <new scene description>, keep action details consistent. |
| Borrow the camera move (dolly, orbit, push-in) | Reference the camera language in @video1, generate <new scene description>, keep the camera move consistent. |
| Borrow the VFX or particle effect | Reference the gold particle effect in @video1, apply the same effect to <subject in image2>. |
This is genuinely a superpower if you're producing a series. Shoot one reference clip with the camera move you want, handheld push-in, smooth orbit, vertigo zoom, and reuse it across ten variations of subject and setting. You get visual continuity without re-prompting cinematography from scratch.
Audio references and beat-sync
Audio inputs work the same way: up to three audio files, referenced as @audio1, @audio2, @audio3[4]. The most common use is beat-sync video, pin the generated motion to a music track so cuts and movements land on the downbeat.
A working beat-sync prompt:
Reference the rhythm of @audio1. A skateboarder cuts through a Tokyo alley at night,
neon reflections in puddles. Camera tracks alongside at hip height. Each turn lands
on the kick drum. Dynamic editing, fast cuts on the beat.The model isn't doing literal audio analysis on every drum hit, but it consistently produces motion that feels synced to the source audio when you tell it to.
Mode-by-mode prompt templates
Seedance 2.0 has three real upstream generation modes, plus several reference-to-video usage patterns that the API surfaces as distinct workflows[2]. Here's how prompts differ across them.
Text-to-video (T2V)
The simplest mode. Only your prompt drives the output. The full three-block formula carries the entire load. Aspect ratio (16:9, 9:16, 4:3, 3:4) and duration (5, 10, or 15 seconds) come from request parameters, not the prompt, don't waste tokens writing "in 16:9 format"[4].
Pattern:
<Subject + action, one sentence>.
<Environment + lighting + style, one sentence>.
<Camera move + audio cue, one sentence>.Run it on seedance2.so/text-to-video when you don't have reference inputs.
Image-to-video (I2V), first-frame mode
You upload one image; it becomes the opening frame. Your prompt describes only the motion and continuation, not the subject, since the subject is already in the image. Re-describing what the image shows usually causes the model to "redraw" the subject and drift away from the source.
Pattern:
<Animation cue: how should the subject move?>
<Camera cue: how should the camera move?>
<Atmosphere cue: ambient sound, light shifts.>Bad I2V prompt:
A blonde woman in a red dress walks through a market.(The image already shows her. You're fighting the model.)
Good I2V prompt:
She turns slowly toward the camera and lifts her hand to brush hair from her face.
Slow dolly in. Distant market chatter, soft afternoon breeze.Image-to-video (I2V), first+last-frame mode
Upload two images. The model interpolates between them and your prompt describes the transition path. This is the cleanest way to get a deterministic narrative arc in 5 seconds.
Pattern:
Transition from <description of first frame> to <description of last frame>.
<Movement style during transition: smooth, snappy, dreamy.>
<Camera cue.>Note: the first and last images should be near-aspect-ratio-matched. The model auto-crops the second to align if they differ, but heavy cropping degrades the result[2].
Reference-to-video / multi-modal reference (R2V)
This is Seedance 2.0's standout mode and the one that justifies most of this guide. You can mix images, videos, and audio in a single request, up to 9 + 3 + 3, and weave them in the prompt with the explicit pointers covered above[2].
The official template structure[3]:
Reference / extract / combine + [εΎηn / @imageN] of <referenced element>,
generate <full scene description>, keep <referenced element> consistent.Example pulled from the official guide[3]:
The scene is set inside @image4 (the restaurant). The girl from @image1 is wearing
the outfit from @image2 and tidying items at the counter. The boy from @image3 is
a customer who walks up to ask for her contact. The logo from @image5 stays in the
bottom-right corner throughout.Five image inputs, five explicit roles, one cohesive narrative. This kind of structured prompt is what enables reference-to-video at production quality. Without the pointer discipline, the model gets vague and the elements blur.
Video editing through R2V
Volcengine treats video editing (add / delete / modify elements) as an R2V usage pattern, not a separate mode[3]. Templates from the official guide:
| Operation | Template |
|---|---|
| Add element | In @video1, at <time/space position>, add <element description>. |
| Delete element | Delete <element> from @video1, keep everything else unchanged. |
| Replace element | Replace <original> in @video1 with <new>, keep motion and camera unchanged. |
The "keep motion and camera unchanged" tail is doing important work, without it, the model often regenerates the scene from scratch. Try it on video editing.
Video extension (forward/backward)
Same R2V mechanism. Two templates[3]:
Extend @video1 backward + <description of pre-segment>.
Extend @video1 forward + <description of post-segment>.The model auto-clips the seam frames from your input, it does not regenerate the original, and only synthesizes the new tail or head. Submit your extension intent on video extension.
Track stitching (3-clip composition)
If you upload multiple videos for stitching, the constraint is hard: maximum 3 video inputs, total duration β€ 15 seconds[3].
Template:
@video1 + <transition description> + connects to @video2 + <transition description>
+ connects to @video3.Worked example from the official guide[3]:
@video1, the moment a leaf hits the ground, gold particles burst, a gust of wind
blows through, connects to @video2.The model invents only the transition frames; the source clips stay intact.
Camera language Seedance 2.0 actually understands
The model was trained on cinematography descriptions, so professional shot vocabulary outperforms casual language. The terms below are the ones I've seen produce reliable output, drawn from production runs across seedance2.so and cross-checked against Volcengine's reference examples[3].
Movement:
slow dolly forward(physical camera moving toward subject) beatszoom in(lens adjustment) every timetracking shot following subject from left to rightorbiting around subject at eye levelcrane shot ascending over <location>steady push-in toward <subject>handheld, slight shakefor documentary feelwhip pan to <new subject>for snappy transitions
Angle:
low angle looking up at subjectmakes subjects look powerfuloverhead establishing shotfor spatial relationshipsdutch tiltfor uneaseextreme close-up on handsdirects attention to detaileye-level medium shotfor neutral conversation framing
Lens:
shallow depth of field, subject in focus, background blurredrack focus from foreground object to subjectanamorphic lens flarewide-angle distortion at the edges
The pattern: use the words a working cinematographer would use. "Cinematic" is too vague; "anamorphic 2.39:1, lens flare on highlights, shallow DoF at f/1.8" is something the model can act on.
Style and lighting descriptors that actually move the needle
Style is where Block 2 of the formula earns its keep. A few categories worth memorizing.
Lighting: golden hour, blue hour, harsh midday sun, soft window light, single key light from screen-left, practical neon underlighting, silhouette against sunset, volumetric god rays through fog.
Color: desaturated teal-and-amber grade, high-contrast monochrome, pastel washed-out palette, saturated tropical color, cool moonlit blues, warm tungsten interiors.
Stock / format: 35mm film, fine grain, 16mm film, heavy grain, digital cinema, clean, VHS, scan lines, color bleed, super-8 home movie, polaroid faded edges.
Genre: Wes Anderson symmetry, pastel, David Fincher cool palette, low-key, Studio Ghibli watercolor backgrounds, '80s sci-fi, neon and chrome, noir, deep shadows, venetian blind patterns.
The closer your descriptor is to a real cinematographic or production reference, the better the result. "Cinematic and dramatic" tells the model nothing. "Roger Deakins golden hour, low contrast, subtle haze" tells it a lot.
Common failure modes and how to fix them
After reviewing hundreds of generations across seedance2.so and reading user reports, the same five failure modes account for most "this looks bad" feedback. Here's the diagnostic lookup.
"The output ignored half of my prompt"
Almost always a length problem. Your prompt is probably over the cap (500 Chinese characters / 1,000 English words)[2], or you stuffed too many ideas into one shot. Rule of thumb: 1β2 subjects per prompt, 2β4 sentences total. If you need three subjects doing three different things in three locations, that's three separate generations stitched in post, not one prompt.
"The reference image got drawn over"
In I2V mode, you described what was in the image instead of what should happen next. Re-write the prompt to describe motion only, not subject. In R2V mode, you forgot the explicit @imageN pointer, so the model treated the upload as a vague aesthetic hint instead of a hard constraint.
"It refuses to generate with my reference photo"
Seedance 2.0 explicitly does not accept real human-face references, uploads with detectable real human faces are rejected at the safety layer[2]. Three workarounds: use a Seedream-generated portrait of a fictional person as your reference, use one of Volcengine's pre-set virtual avatars, or supply documented authorization for the real person depicted. There is no "turn off this filter" toggle.
"The motion is jittery / the subject morphs"
You probably went too long. Generate at 5 seconds first to verify the prompt holds together, then commit to 10 or 15 seconds. Quality at 15s is meaningfully different from quality at 5s, not because the model is worse, but because more is happening, and any prompt ambiguity gets amplified across 25β35 frames per second of additional content.
"Audio is out of sync with the visuals"
Either you didn't reference the audio explicitly with @audio1, or your prompt described visual rhythm that contradicts the actual audio. If the audio is a 110 BPM track and your prompt says "slow contemplative pacing," the model has to pick one. Tell it explicitly: match cuts to the kick drum of @audio1 is unambiguous.
Iteration workflow that doesn't burn credits
Generating a 10-second high-quality Seedance 2.0 video runs around 7 credits per second on the standard tier, about 70 credits per generation, or roughly $2.80 at the entry-tier credit rate[5]. Wasted runs add up. The workflow that minimizes waste:
- Draft on the fast/basic tier first. Same prompt, same parameters, lower credit cost. If the composition is wrong on basic, it'll be wrong on high too, fix it before paying for high. See pricing for current tier rates.
- Generate at 5 seconds first, even if you ultimately want 15. A 5-second test costs a third of a 15-second run. If the prompt holds at 5, scale up.
- One variable at a time. Don't change the subject, the camera, and the style in a single re-roll. You won't know which change moved the needle.
- Save your seed images. When a Seedream-generated portrait works as a reference, keep that exact image, re-running the same R2V prompt with the same reference is the closest thing to a deterministic re-roll.
- Use the prompt-enhancement toggle when starting from a sparse idea. The studio's web-search-enhanced mode rewrites your prompt with retrieved context before sending it to the model[4]. Useful for queries like "what does an authentic Seoul jjajangmyeon shop interior look like at 11pm on a weekday", since the model now has retrieved context to draw from.
Multi-language prompts and when to switch
Seedance 2.0 was trained on a multilingual corpus and supports prompts in English, Chinese, Japanese, Indonesian, Spanish, and Portuguese[2]. The older Seedance variants (1.5 Pro, 1.0 Pro) only support English and Chinese. This matters in two scenarios:
- Localized dialogue. If the generated video needs Spanish-speaking characters or Korean subtitles, write the dialogue in the target language directly. Don't write English and ask the model to "have them speak Spanish", it works, but quality is worse than just writing the line in Spanish.
- Cultural specificity. A prompt like "a typical Mexican breakfast on a wooden table" written in Spanish (
un desayuno mexicano tΓpico sobre una mesa de madera) frequently produces more culturally accurate output than the English equivalent. The training data weighting differs.
For everything else, English is the default and works fine. Chinese prompts are slightly more concise per token (β€ 500 characters versus β€ 1,000 English words) but produce equivalent output.
FAQ
How long should a Seedance 2.0 prompt be?
Aim for 2β4 sentences, roughly 60β200 English words. The hard cap is 1,000 English words / 500 Chinese characters[2], but you'll hit diminishing returns long before that. Past ~250 words the model starts compressing your prompt internally and you lose specifics.
Does Seedance 2.0 support negative prompts?
Not as a dedicated parameter. There is no "negative_prompt" field in the API contract[2]. You can add constraints inline, no on-screen text, no logos, no people in the background, and the model honors them with reasonable consistency. It's not as deterministic as a true negative-prompt slot in image models like Stable Diffusion, but it works.
Can I reference 9 images and 3 videos and 3 audio files in the same prompt?
Yes, that's the maximum multi-modal R2V load: up to 9 images, 3 videos, 3 audio inputs in a single request[2]. The API enforces these caps. Practically, prompts with that many references are very hard to keep coherent, most production R2V work uses 2β5 image references and at most one video or audio reference.
Why does my generation fail with "real face not allowed"?
Seedance 2.0 refuses references containing detectable real human faces[2]. Use a fictional generated portrait, a pre-authorized virtual avatar, or upload an explicit authorization for the depicted real person. The check runs upstream at the model level, there's no platform-level override.
What's the difference between Seedance 2.0 and Seedance 2.0 Fast for prompts?
Same prompt grammar, same reference syntax, same length caps. Fast is the lower-cost basic-quality tier; Preview is high-quality. A prompt that works on Fast will work identically on Preview, just at higher visual fidelity and roughly 1.7Γ the credit cost on most providers[5]. Iterate on Fast, finalize on Preview.
Can I write prompts in Chinese for English-language output, or vice versa?
Yes. Prompt language and output language are independent. Write in whichever language you think most clearly, the model handles the cross-language translation internally. The exception is on-screen text and dialogue: those will appear in the language you wrote them in.
Does prompt order within a sentence matter?
Yes, materially. Earlier tokens get more attention budget. Lead with the hardest constraints (subject identity, primary action, key location) and let stylistic flourishes follow. "A red sports car at sunset, cinematic" prompts the model to optimize for "red sports car"; "Cinematic shot of a red sports car at sunset" weights "cinematic shot" first and the car becomes secondary.
Is there an official Seedance 2.0 prompt library?
Volcengine ships an official prompt guide with worked examples for slogans, subtitles, bubble dialogue, multi-image references, action references, camera-move references, VFX references, and video editing[1][3]. It's the canonical source. The studio at seedance2.so/text-to-video maps the same patterns to a UI; if you can express the prompt structure in either, you can use the other.
Prompts that ship: the recap
Writing Seedance 2.0 prompts well comes down to three habits. First, follow the three-block formula, subject and action, then environment and style, then camera and audio cues, and front-load your hardest constraints in the opening sentence. Second, use the explicit reference syntax (@image1 through @image9, @video1 through @video3, @audio1 through @audio3) every single time you have multi-modal inputs; the difference between vague reference and pointered reference is the difference between "kind of works" and "ships." Third, respect the constraints the model documents, 2β4 sentence prompts, 1β2 subjects, no real human faces, length under 1,000 English words, and iterate cheap on the Fast tier before committing credits to Preview. Do those three things and your Seedance 2.0 prompts will produce ship-quality output the first or second roll, not the fifth or sixth.
References
- Volcengine ArkClaw. Doubao Seedance 2.0 η³»εζη€Ίθ―ζε, Section 1: ζ»δ½θ¦ι’. Retrieved May 2026 from volcengine.com/docs/82379/2222480
- Volcengine ArkClaw. εε»Ίθ§ι’ηζδ»»ε‘ API, Seedance 2.0 model capability spec, prompt language and length, input limits. Retrieved May 2026 from volcengine.com/docs/82379/1520757
- Volcengine ArkClaw. Doubao Seedance 2.0 η³»εζη€Ίθ―ζε, Sections 3β5: image / video reference and editing templates. Retrieved May 2026 from volcengine.com/docs/82379/2222480
- Seedance2.so. Studio reference syntax and parameter helpText for omni-reference generation. Retrieved May 2026 from seedance2.so/reference-to-video
- Seedance2.so. Pricing and credit-per-second rates by tier. Retrieved May 2026 from seedance2.so/pricing
Further reading
- BytePlus ModelArk. Product updates, Dreamina Seedance 2.0 API release. docs.byteplus.com/en/docs/ModelArk
- ByteDance Seed. Seedance technical report and benchmark results. seed.bytedance.com/seedance
Author

Categories
More Posts

How to Generate AI Images: A Practical Guide for 2026
Learn how to generate AI images from text prompts, reference photos, and style guides. Covers how the technology works, prompt tips, and a step-by-step walkthrough using Seedance 2.0's text-to-image tool.


Seedance 2.0 vs Veo 3: Which AI video generator fits your workflow?
A head-to-head comparison of Seedance 2.0 and Google Veo 3 covering output quality, audio generation, multi-reference input, pricing, and real production use cases for 2026.


Top 10 AI Video Generators in 2026 (Including Seedance 2.0), Ranked and Tested
We tested every major AI video generator in 2026. Here are the 10 best, ranked by output quality, features, pricing, and real production value.
