Skip to main content
๐Ÿ”ฅ10 days 01:26:06
Unlimited GPT Image 2 at medium quality, 1 image per run, with Business or EnterpriseUnlimited GPT Image 2 ยท medium onlyGet Unlimited
LogoSeedance 2.0
  • Image to Video
  • Guide
  • Pricing
  • My Creations
Gemini Omni: What Google Actually Shipped at I/O 2026
2026/05/20

Gemini Omni: What Google Actually Shipped at I/O 2026

Gemini Omni replaces Veo in the Gemini app with native multimodal video generation, 10-second clips, and conversational editing. Here's what Google shipped.

On the I/O 2026 stage, Google did something it almost never does. It killed one of its own product lines mid-flight. Buried inside the Gemini Omni announcement, in the small print under the new video model's marketing page, sits the actual headline: "Gemini Omni will replace Veo in the Gemini app."[1] Veo 3.1 is over inside Gemini. The brand that defined Google's video-generation push for two years is being retired in favor of a single model that handles text, image, audio, and video in one architecture.

I spent the first 12 hours after the keynote reading the official model card, watching the demos, and tracking what serious builders on X actually got out of it. This article is what stuck. Eight capability axes that separate Gemini Omni Flash from Veo, the 10-second cap and what it really gates, the audio-editing feature Google chose not to ship, and the comparison every builder ran in the first 24 hours. By the end you'll know what Omni is, what it isn't, and where Google left the door open.

TL;DR

  • Gemini Omni Flash is a native multimodal transformer that takes text, image, audio, and video as input and produces video with audio as output.[2]
  • It replaces Veo 3.1 inside the Gemini app, ending Google's separate video-model brand.[1]
  • Clips are capped at 10 seconds at launch, up from Veo's 8.[3]
  • Available now to Google AI Plus, Pro, and Ultra subscribers, plus free on YouTube Shorts and Create. Developer API is "coming weeks."[4]
  • Google deliberately held back voice and speech editing of existing audio, citing safety review.[4]
  • Google's own listed limitations: complex motion, accurate text rendering, full consistency across edits.[2]

What "Gemini Omni" actually means in the model card

Gemini Omni Flash is a transformer-based model with native multimodal support for text, vision, video, and audio inputs.[2] That sentence reads like a press-release filler, but it matters. The previous Google stack had Gemini handle reasoning, Veo handle pixels, and some glue layer translate between them. Omni Flash folds the generation step directly into the same Vaswani-style architecture as Gemini. The training stack is JAX on TPUs, and the training data was filtered for safety, deduplicated semantically, and annotated with captions at different levels of detail.

The model card lists five evaluation tracks: Text-to-Video-with-Audio (T2VA), Image-to-Video-with-Audio (I2VA), Reference-to-Video-with-Audio (R2VA), video editing, and image generation.[2] Google says it will publish the actual evaluation numbers when the developer API rolls out. That timing is the only firm signal we have on when third-party benchmarks become possible.

The model card also admits a limit Google rarely puts in print on launch day. "Maintaining complete consistency throughout edits, generating scenes with complex motion, or rendering perfectly accurate text remains a challenge."[2] This is not the kind of disclosure you usually get on launch day, and it lines up with what builders are saying in the next section.

The eight capabilities that separate Omni from Veo

Veo was a video generator with a text encoder. Gemini Omni is closer to a video generator with a brain. The functional differences cluster into eight axes.

1. Cross-modal reasoning rather than stitching. A single prompt can mix text, an image, an audio clip, and a reference video. The model reasons across all of them inside one transformer pass and emits one coherent clip with audio attached.[2]

2. Reference anything. Image, text, video, or audio reference can be passed in and the model treats it as authoritative style or content guidance. Google's hand-opens demo uses a still image as the architectural reference and unfolds it into a 3D structure that respects the original's geometry.[1]

3. Multi-turn conversational editing with consistency. Every edit builds on the previous one. Google's violinist demo shows three sequential edits: transport the character to a new environment, make the violin invisible, change the camera angle. The character, lighting, and motion stay locked across all three.[1]

4. Gemini's world knowledge applied to generation. This is the axis no other video model on the market can match. As @xiaohu put it on launch night, "Omni ๆŽฅ้€šไบ† Gemini ็š„ไธ–็•Œ็Ÿฅ่ฏ†ๅบ“ ... ๅฎƒๅฏไปฅๅš'่›‹็™ฝ่ดจๆŠ˜ๅ '็š„้ปๅœŸๅŠจ็”ปๆ•™็จ‹่ง†้ข‘"[5]. Pure diffusion-based video models will hallucinate plausible protein-folding visuals. Omni renders the correct alpha-helix-to-beta-sheet transition because the underlying Gemini reasoning module knows the biology.

5. Physics-aware generation. Google lists "intuitive understanding of forces like gravity, kinetic energy, and fluid dynamics" as a capability, and the mirror-touch demos show liquid ripples, refraction, and material transformation that respect physical continuity.[1]

6. Natural-language object and character swap. A prompt like "Change spaceship to <object>" works as a one-shot edit on an existing clip, with the model maintaining lighting, motion, and occlusion relationships.[1]

7. Style and aesthetic transfer. The same source clip can be repainted into 3D voxel art, monochrome line art, felted-puppet form, or holographic line-trace using single-sentence prompts.[1]

8. Identity-verified avatars. You record yourself reading a number sequence, Google stores the avatar, and you can then place it inside generated videos. The verification step is Google's deepfake gate.[6]

What the 10-second cap actually buys you

The 10-second clip ceiling looks like a regression next to longer Sora and Seedance outputs, but Google insists it's a deployment choice. Brichtova, the product lead, told TechCrunch the cap is "not a model limitation, but rather a decision based both on a desire to get it into more hands and an anticipation that most users won't want to make much longer videos yet."[3]

Builders are already pushing the limit. @aimikoda's beat-synced portrait test uses a 16-cut prompt structure with hard cuts on every beat, fitting 16 distinct shots into the 10-second window[7]. That density only works because Omni handles consistent character framing across the cuts. The same prompt run on a stitched video pipeline would lose identity at cut 3.

@yyyole caught the version delta within hours of the keynote. Gemini's previous in-app generator capped at 8 seconds. Omni Flash bumps it to 10. The two-second jump is small, but it confirms Google is moving the ceiling, not just rebranding.[8]

The longer-form story sits with the unreleased Omni Pro. Google has confirmed Pro exists but won't commit to a date, saying it will ship "when we feel like we're at a point where we have a step change above Flash."[3] Translation: not soon.

Where you can use Gemini Omni today

Distribution is wider than any previous Google video model launch. Gemini Omni Flash is live inside the Gemini app for AI Plus, Pro, and Ultra subscribers globally. It's wired into Google Flow (the AI filmmaking workspace) and Flow Music (lyric and music-video editing). YouTube Shorts gets it at no extra cost, and the YouTube Create app picks it up this week.[4]

Flow itself got a parallel upgrade. A new Flow Agent acts as a project-level reasoning partner that suggests plot turns, batches edits, and organizes assets. Flow also gained "Bespoke Tools," a no-code way to build custom image editors and shaders that other users can remix. Flow Music gained section-level lyric editing, cover-style transformation, and conversational music-video direction. Mobile apps shipped for Flow (Android beta) and Flow Music (iOS).[9]

The piece that's not here yet: developer and enterprise API access. Google says it's "coming weeks" with no firm date. Until that ships, every workflow runs through the Gemini app or Flow UI. There's no programmatic integration path on day one.

What Google deliberately did not ship

The most useful framing for the Omni launch isn't what's in the box. It's what Google chose to leave out. The model card and launch post both flag a deliberate hold: voice and speech editing of existing audio.

Google's own wording from the launch post: "Beyond the avatar feature, in terms of editing videos to change audio and speech, we are still working to test this and better understand how we can bring this capability to users responsibly."[4] The omitted capability is exactly the one that would have been most useful and most dangerous. Changing what a person says in an existing video, while keeping their face and voice, is the operational definition of deepfake at scale.

The avatar feature that did ship has a verification gate. To make an avatar of yourself, you record a video reading a number sequence aloud. The model only stores avatars that passed this challenge.[6] That's the same anti-impersonation pattern OpenAI used for the now-discontinued Sora Cameos feature. Google studied that episode visibly.

Every Omni-generated video carries SynthID, Google's imperceptible watermark, which can be verified inside the Gemini app, Chrome, and Google Search. Google says SynthID has now marked over 100 billion AI-generated assets, with OpenAI, ElevenLabs, and Kakao adopting the same standard.[6] SynthID isn't deepfake prevention, but it's the closest the industry has to a forensic baseline.

How the first 24 hours of builder testing went

Within hours of the keynote, two independent comparison tests landed on X, both pitting Omni against Seedance 2.0 under identical conditions. The most-cited one came from @aimikoda. "Seedance 2.0 vs Gemini Omni, tested under the same conditions. I gave the exact same prompt + storyboard reference + character reference to Gemini Omni as well. Gemini Omni surprised me with style quality and got closer than I expected in prompt adherence. But Seedance still feels ahead for storyboard execution, motion energy, camera language and environmental interaction. Gemini looks good. Seedance feels directed."[10]

The TopviewAI team ran the same matchup and pushed 23K views in under six hours: "Google has just launched its new Gemini Omni Flash model. We ran side-by-side tests against Seedance 2.0 right away."[11] That two independent labs ran the same comparison within hours is a real signal about the model market. Seedance 2.0 is the model Omni is being measured against, not Sora 2 or Runway Gen-4.

The Chinese-language AI community framed it sharper. @EHuanglu (el.cine) reduced the pitch to seven words: "Gemini Omni is here. Its Nano Banana but for video."[12] @xiaohu pushed further into the technical claim: Omni is plugged into Gemini's world knowledge base, which is why it can produce a clay-animation tutorial on protein folding or a 26-letter rhythmic explainer that other video models can't.[5]

The overall builder read: style and prompt adherence are excellent, the world-knowledge integration is real and obvious in tutorial-style prompts, and the conversational editing changes the iteration loop. The things that don't yet hold up: complex motion, fine text rendering, and deep storyboard consistency across many shots.

Trying Omni-style workflows without Omni access

If you're outside the Gemini Plus tier or you need API access today, the closest accessible workflow on the third-party side is image-to-video and reference-to-video generation through models that have shipped APIs. Seedance 2.0 is the model builders kept naming in the side-by-side tests, and it's already exposed through several third-party providers. Disclosure: I work on seedance2.so/text-to-video, one of those third-party wrappers around Seedance 2.0. We're not affiliated with Google or ByteDance, and we don't replicate Omni's conversational editing yet. What you can do today is hand a storyboard image to seedance2.so/image-to-video and get a 10-second clip that respects the keyframes, which covers the most-requested use case in the Omni demos.

FAQ

What is Gemini Omni?

Gemini Omni is Google's multimodal video generation and editing model family. The first member, Gemini Omni Flash, takes text, image, audio, or video input and produces video with audio. It replaces Veo 3.1 inside the Gemini app.[1]

How is Gemini Omni different from Veo 3.1?

Veo was a standalone video model that took a text prompt and produced a clip. Omni is wired directly into Gemini's reasoning model and supports cross-modal input, multi-turn conversational editing, and reference-driven generation. Google has stated Omni will replace Veo inside the Gemini app.[1]

Can I use the Gemini Omni API today?

No. The developer and enterprise API is announced as "coming weeks" with no firm date. Until then access is through Gemini app, Flow, or YouTube Shorts only.[4]

How long are Gemini Omni videos?

10 seconds at launch. Google describes this as a deployment choice rather than a model ceiling. The unreleased Omni Pro is expected to lift this when it ships.[3]

What did Google hold back from the Gemini Omni launch?

Voice and speech editing of existing audio. The capability exists in the model but Google is reviewing it for responsible deployment. Avatar creation is gated behind identity verification.[4]

Is Gemini Omni free?

Free on YouTube Shorts and YouTube Create. Inside the Gemini app it requires an AI Plus, Pro, or Ultra subscription. Inside Flow the same subscription gating applies, with different feature limits by tier.[4]

Will Veo go away completely?

Google has confirmed Omni replaces Veo inside the Gemini app. Veo's enterprise availability through Vertex AI hasn't been formally retired, but the consumer brand is finished. Treat Omni as the successor.[1]

Reading the Gemini Omni release

Two days in, the right way to read Gemini Omni isn't as a video model. It's as Google's first serious attempt to fold media generation into a reasoning model rather than bolting it on. The 10-second cap is artificial, the API delay is annoying, and the held-back voice editing is the right call. The capability gap to watch is the world-knowledge axis. Sora 2 and Seedance 2.0 are excellent pure video generators, but neither has a frontier LLM wired into the generation step, and that's the lane Google has staked out for Gemini Omni Flash to grow into.

References

  1. Google. Gemini Omni โ€” Create & edit videos as easy as having a conversation. Retrieved May 2026 from gemini.google/overview/video-generation
  2. Google DeepMind. Gemini Omni Flash โ€” Model Card. Published May 19, 2026. Retrieved May 2026 from deepmind.google/models/model-cards/gemini-omni-flash
  3. TechCrunch. Google's Gemini Omni turns images, audio, and text into video. Retrieved May 2026 from techcrunch.com/2026/05/19/googles-gemini-omni-turns-images-audio-and-text-into-video
  4. Google. Introducing Gemini Omni. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni
  5. Kลda (@xiaohu). Omni ๆŽฅ้€šไบ† Gemini ็š„ไธ–็•Œ็Ÿฅ่ฏ†ๅบ“. X, May 19, 2026. x.com/xiaohu/status/2056880323607298286
  6. TechTimes. Google Launches Gemini Omni Video Model, but Holds Back Its Riskiest Feature. Retrieved May 2026 from techtimes.com/articles/316859
  7. Kลda (@aimikoda). Beat-synced portrait video prompt. X, May 19, 2026. x.com/aimikoda/status/2056865829782704534
  8. ๆฒ้˜ณ (@yyyole). Gemini ๅฏไปฅ็”Ÿๆˆ 10s ็š„่ง†้ข‘๏ผˆไน‹ๅ‰ๆ˜ฏ 8s๏ผ‰. X, May 19, 2026. x.com/yyyole/status/2056883347373060197
  9. Google. New agents, mobile apps and Gemini Omni for Google Flow and Google Flow Music. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/google-labs/flow-updates
  10. Kลda (@aimikoda). Seedance 2.0 vs Gemini Omni, tested under the same conditions. X, May 19, 2026. x.com/aimikoda/status/2056840097455014017
  11. TopviewAI (@TopviewAIhq). Side-by-side tests against Seedance 2.0. X, May 19, 2026. x.com/TopviewAIhq/status/2056795047685927337
  12. el.cine (@EHuanglu). Gemini Omni is here. Its Nano Banana but for video. X, May 19, 2026. x.com/EHuanglu/status/2056798387647987941

Further reading

  • Decrypt. Google Unveils Gemini Omni โ€” A Next-Gen AI Video Builder That Can 'Simulate the World'. decrypt.co/368393
  • VentureBeat. Google unveils Gemini Omni 'any-to-any' AI model: what enterprises should know. venturebeat.com/ai/google-unveils-gemini-omni-any-to-any-ai-model
  • Google DeepMind (@GoogleDeepMind). We're dropping Gemini Omni. x.com/GoogleDeepMind/status/2056786446636212467
All Posts

Author

avatar for Seedance Team
Seedance Team

Categories

  • News
TL;DRWhat "Gemini Omni" actually means in the model cardThe eight capabilities that separate Omni from VeoWhat the 10-second cap actually buys youWhere you can use Gemini Omni todayWhat Google deliberately did not shipHow the first 24 hours of builder testing wentTrying Omni-style workflows without Omni accessFAQWhat is Gemini Omni?How is Gemini Omni different from Veo 3.1?Can I use the Gemini Omni API today?How long are Gemini Omni videos?What did Google hold back from the Gemini Omni launch?Is Gemini Omni free?Will Veo go away completely?Reading the Gemini Omni releaseReferencesFurther reading

More Posts

Seedance 2.0 camera movement prompts: the complete guide to cinematic AI video
Tutorial

Seedance 2.0 camera movement prompts: the complete guide to cinematic AI video

Master camera movement prompts for Seedance 2.0 and other AI video generators. A three-tier system covering basic movements, emotional modifiers, and advanced combinations.

avatar for Seedance Team
Seedance Team
2026/02/09
How to Use Seedance 2.0: A Quick Guide to AI Video Generation
Tutorial

How to Use Seedance 2.0: A Quick Guide to AI Video Generation

Learn how to use Seedance 2.0 to generate videos from text, images, and references. Covers all supported modes including text-to-video, image-to-video, video editing, and beat-sync.

avatar for Seedance Team
Seedance Team
2026/02/08
How to use reference images in Seedance 2.0 for consistent AI video
Tutorial

How to use reference images in Seedance 2.0 for consistent AI video

A practical guide to using reference images in AI video generation. Covers character consistency, style matching, and multi-reference workflows with Seedance 2.0 and other tools.

avatar for Seedance Team
Seedance Team
2026/02/09
LogoSeedance 2.0

Seedance 2.0 โ€” the free AI video generator for text-to-video, image-to-video, video editing, and more. 1080p output with native audio.

Email
Built withLogo of seedance2seedance2
AI Video Models
  • Vidu Q3 Video Generator
  • Seedance 2 Fast
  • Seedance 2.0 API
  • Seedance 1.5 Pro
  • Veo 3
  • Kling V3
  • Grok Video
  • PixVerse AI
  • Happy Horse AI
  • Seedance 2.5
Video Generators
  • TikTok Video Generator
  • UGC Video Generator
  • Ecommerce Video Generator
  • Short Video Generator
  • Cinematic Video Generator
AI Image
  • Seedream 5.0
  • Seedream 4.5
  • Seedream 4.0
  • Nano Banana Pro
  • GPT Image 2
  • Grok Imagine
  • Nano Banana 2
AI Tools
  • AI Video Prompt Generator
  • Seedance 2 Prompt Generator
  • Nano Banana Prompt Generator
  • AI Image Analyzer
  • AI Video Analyzer
  • Seedance 2.0 Prompts
  • Nano Banana Pro Prompts
  • Video Watermark Remover
Resources & Legal
  • Pricing
  • Blog
  • About
  • Contact
  • Privacy Policy
  • Terms of Service
  • Refund Policy
ยฉ 2026 Seedance 2.0 All Rights Reserved.
ai tools code.marketFeatured on findly.toolsFeatured on ShowMeBestAIMossAI ToolsDang.aiFeatured on Twelve ToolsIAListรฉ sur IA-Insights