Gemini Omni: What Google Actually Shipped at I/O 2026

On the I/O 2026 stage, Google did something it almost never does. It killed one of its own product lines mid-flight. Buried inside the Gemini Omni announcement, in the small print under the new video model's marketing page, sits the actual headline: "Gemini Omni will replace Veo in the Gemini app."^[1] Veo 3.1 is over inside Gemini. The brand that defined Google's video-generation push for two years is being retired in favor of a single model that handles text, image, audio, and video in one architecture.

I spent the first 12 hours after the keynote reading the official model card, watching the demos, and tracking what serious builders on X actually got out of it. This article is what stuck. Eight capability axes that separate Gemini Omni Flash from Veo, the 10-second cap and what it really gates, the audio-editing feature Google chose not to ship, and the comparison every builder ran in the first 24 hours. By the end you'll know what Omni is, what it isn't, and where Google left the door open.

TL;DR

Gemini Omni Flash is a native multimodal transformer that takes text, image, audio, and video as input and produces video with audio as output.^[2]
It replaces Veo 3.1 inside the Gemini app, ending Google's separate video-model brand.^[1]
Clips are capped at 10 seconds at launch, up from Veo's 8.^[3]
Available now to Google AI Plus, Pro, and Ultra subscribers, plus free on YouTube Shorts and Create. Developer API is "coming weeks."^[4]
Google deliberately held back voice and speech editing of existing audio, citing safety review.^[4]
Google's own listed limitations: complex motion, accurate text rendering, full consistency across edits.^[2]

What "Gemini Omni" actually means in the model card

Gemini Omni Flash is a transformer-based model with native multimodal support for text, vision, video, and audio inputs.^[2] That sentence reads like a press-release filler, but it matters. The previous Google stack had Gemini handle reasoning, Veo handle pixels, and some glue layer translate between them. Omni Flash folds the generation step directly into the same Vaswani-style architecture as Gemini. The training stack is JAX on TPUs, and the training data was filtered for safety, deduplicated semantically, and annotated with captions at different levels of detail.

The model card lists five evaluation tracks: Text-to-Video-with-Audio (T2VA), Image-to-Video-with-Audio (I2VA), Reference-to-Video-with-Audio (R2VA), video editing, and image generation.^[2] Google says it will publish the actual evaluation numbers when the developer API rolls out. That timing is the only firm signal we have on when third-party benchmarks become possible.

The model card also admits a limit Google rarely puts in print on launch day. "Maintaining complete consistency throughout edits, generating scenes with complex motion, or rendering perfectly accurate text remains a challenge."^[2] This is not the kind of disclosure you usually get on launch day, and it lines up with what builders are saying in the next section.

The eight capabilities that separate Omni from Veo

Veo was a video generator with a text encoder. Gemini Omni is closer to a video generator with a brain. The functional differences cluster into eight axes.

1. Cross-modal reasoning rather than stitching. A single prompt can mix text, an image, an audio clip, and a reference video. The model reasons across all of them inside one transformer pass and emits one coherent clip with audio attached.^[2]

2. Reference anything. Image, text, video, or audio reference can be passed in and the model treats it as authoritative style or content guidance. Google's hand-opens demo uses a still image as the architectural reference and unfolds it into a 3D structure that respects the original's geometry.^[1]

3. Multi-turn conversational editing with consistency. Every edit builds on the previous one. Google's violinist demo shows three sequential edits: transport the character to a new environment, make the violin invisible, change the camera angle. The character, lighting, and motion stay locked across all three.^[1]

4. Gemini's world knowledge applied to generation. This is the axis no other video model on the market can match. As @xiaohu put it on launch night, "Omni 接通了 Gemini 的世界知识库 ... 它可以做'蛋白质折叠'的黏土动画教程视频"^[5]. Pure diffusion-based video models will hallucinate plausible protein-folding visuals. Omni renders the correct alpha-helix-to-beta-sheet transition because the underlying Gemini reasoning module knows the biology.

5. Physics-aware generation. Google lists "intuitive understanding of forces like gravity, kinetic energy, and fluid dynamics" as a capability, and the mirror-touch demos show liquid ripples, refraction, and material transformation that respect physical continuity.^[1]

6. Natural-language object and character swap. A prompt like "Change spaceship to <object>" works as a one-shot edit on an existing clip, with the model maintaining lighting, motion, and occlusion relationships.^[1]

7. Style and aesthetic transfer. The same source clip can be repainted into 3D voxel art, monochrome line art, felted-puppet form, or holographic line-trace using single-sentence prompts.^[1]

8. Identity-verified avatars. You record yourself reading a number sequence, Google stores the avatar, and you can then place it inside generated videos. The verification step is Google's deepfake gate.^[6]

What the 10-second cap actually buys you

The 10-second clip ceiling looks like a regression next to longer Sora and Seedance outputs, but Google insists it's a deployment choice. Brichtova, the product lead, told TechCrunch the cap is "not a model limitation, but rather a decision based both on a desire to get it into more hands and an anticipation that most users won't want to make much longer videos yet."^[3]

Builders are already pushing the limit. @aimikoda's beat-synced portrait test uses a 16-cut prompt structure with hard cuts on every beat, fitting 16 distinct shots into the 10-second window^[7]. That density only works because Omni handles consistent character framing across the cuts. The same prompt run on a stitched video pipeline would lose identity at cut 3.

@yyyole caught the version delta within hours of the keynote. Gemini's previous in-app generator capped at 8 seconds. Omni Flash bumps it to 10. The two-second jump is small, but it confirms Google is moving the ceiling, not just rebranding.^[8]

The longer-form story sits with the unreleased Omni Pro. Google has confirmed Pro exists but won't commit to a date, saying it will ship "when we feel like we're at a point where we have a step change above Flash."^[3] Translation: not soon.

Where you can use Gemini Omni today

Distribution is wider than any previous Google video model launch. Gemini Omni Flash is live inside the Gemini app for AI Plus, Pro, and Ultra subscribers globally. It's wired into Google Flow (the AI filmmaking workspace) and Flow Music (lyric and music-video editing). YouTube Shorts gets it at no extra cost, and the YouTube Create app picks it up this week.^[4]

Flow itself got a parallel upgrade. A new Flow Agent acts as a project-level reasoning partner that suggests plot turns, batches edits, and organizes assets. Flow also gained "Bespoke Tools," a no-code way to build custom image editors and shaders that other users can remix. Flow Music gained section-level lyric editing, cover-style transformation, and conversational music-video direction. Mobile apps shipped for Flow (Android beta) and Flow Music (iOS).^[9]

The piece that's not here yet: developer and enterprise API access. Google says it's "coming weeks" with no firm date. Until that ships, every workflow runs through the Gemini app or Flow UI. There's no programmatic integration path on day one.

What Google deliberately did not ship

The most useful framing for the Omni launch isn't what's in the box. It's what Google chose to leave out. The model card and launch post both flag a deliberate hold: voice and speech editing of existing audio.

Google's own wording from the launch post: "Beyond the avatar feature, in terms of editing videos to change audio and speech, we are still working to test this and better understand how we can bring this capability to users responsibly."^[4] The omitted capability is exactly the one that would have been most useful and most dangerous. Changing what a person says in an existing video, while keeping their face and voice, is the operational definition of deepfake at scale.

The avatar feature that did ship has a verification gate. To make an avatar of yourself, you record a video reading a number sequence aloud. The model only stores avatars that passed this challenge.^[6] That's the same anti-impersonation pattern OpenAI used for the now-discontinued Sora Cameos feature. Google studied that episode visibly.

Every Omni-generated video carries SynthID, Google's imperceptible watermark, which can be verified inside the Gemini app, Chrome, and Google Search. Google says SynthID has now marked over 100 billion AI-generated assets, with OpenAI, ElevenLabs, and Kakao adopting the same standard.^[6] SynthID isn't deepfake prevention, but it's the closest the industry has to a forensic baseline.

How the first 24 hours of builder testing went

Within hours of the keynote, two independent comparison tests landed on X, both pitting Omni against Seedance 2.0 under identical conditions. The most-cited one came from @aimikoda. "Seedance 2.0 vs Gemini Omni, tested under the same conditions. I gave the exact same prompt + storyboard reference + character reference to Gemini Omni as well. Gemini Omni surprised me with style quality and got closer than I expected in prompt adherence. But Seedance still feels ahead for storyboard execution, motion energy, camera language and environmental interaction. Gemini looks good. Seedance feels directed."^[10]

The TopviewAI team ran the same matchup and pushed 23K views in under six hours: "Google has just launched its new Gemini Omni Flash model. We ran side-by-side tests against Seedance 2.0 right away."^[11] That two independent labs ran the same comparison within hours is a real signal about the model market. Seedance 2.0 is the model Omni is being measured against, not Sora 2 or Runway Gen-4.

The Chinese-language AI community framed it sharper. @EHuanglu (el.cine) reduced the pitch to seven words: "Gemini Omni is here. Its Nano Banana but for video."^[12] @xiaohu pushed further into the technical claim: Omni is plugged into Gemini's world knowledge base, which is why it can produce a clay-animation tutorial on protein folding or a 26-letter rhythmic explainer that other video models can't.^[5]

The overall builder read: style and prompt adherence are excellent, the world-knowledge integration is real and obvious in tutorial-style prompts, and the conversational editing changes the iteration loop. The things that don't yet hold up: complex motion, fine text rendering, and deep storyboard consistency across many shots.

Trying Omni-style workflows without Omni access

If you're outside the Gemini Plus tier or you need API access today, the closest accessible workflow on the third-party side is image-to-video and reference-to-video generation through models that have shipped APIs. Seedance 2.0 is the model builders kept naming in the side-by-side tests, and it's already exposed through several third-party providers. Disclosure: I work on seedance2.so/text-to-video, one of those third-party wrappers around Seedance 2.0. We're not affiliated with Google or ByteDance, and we don't replicate Omni's conversational editing yet. What you can do today is hand a storyboard image to seedance2.so/image-to-video and get a 10-second clip that respects the keyframes, which covers the most-requested use case in the Omni demos.

FAQ

What is Gemini Omni?

Gemini Omni is Google's multimodal video generation and editing model family. The first member, Gemini Omni Flash, takes text, image, audio, or video input and produces video with audio. It replaces Veo 3.1 inside the Gemini app.^[1]

How is Gemini Omni different from Veo 3.1?

Veo was a standalone video model that took a text prompt and produced a clip. Omni is wired directly into Gemini's reasoning model and supports cross-modal input, multi-turn conversational editing, and reference-driven generation. Google has stated Omni will replace Veo inside the Gemini app.^[1]

Can I use the Gemini Omni API today?

No. The developer and enterprise API is announced as "coming weeks" with no firm date. Until then access is through Gemini app, Flow, or YouTube Shorts only.^[4]

How long are Gemini Omni videos?

10 seconds at launch. Google describes this as a deployment choice rather than a model ceiling. The unreleased Omni Pro is expected to lift this when it ships.^[3]

What did Google hold back from the Gemini Omni launch?

Voice and speech editing of existing audio. The capability exists in the model but Google is reviewing it for responsible deployment. Avatar creation is gated behind identity verification.^[4]

Is Gemini Omni free?

Free on YouTube Shorts and YouTube Create. Inside the Gemini app it requires an AI Plus, Pro, or Ultra subscription. Inside Flow the same subscription gating applies, with different feature limits by tier.^[4]

Will Veo go away completely?

Google has confirmed Omni replaces Veo inside the Gemini app. Veo's enterprise availability through Vertex AI hasn't been formally retired, but the consumer brand is finished. Treat Omni as the successor.^[1]

Reading the Gemini Omni release

Two days in, the right way to read Gemini Omni isn't as a video model. It's as Google's first serious attempt to fold media generation into a reasoning model rather than bolting it on. The 10-second cap is artificial, the API delay is annoying, and the held-back voice editing is the right call. The capability gap to watch is the world-knowledge axis. Sora 2 and Seedance 2.0 are excellent pure video generators, but neither has a frontier LLM wired into the generation step, and that's the lane Google has staked out for Gemini Omni Flash to grow into.

References

Google. Gemini Omni — Create & edit videos as easy as having a conversation. Retrieved May 2026 from gemini.google/overview/video-generation
Google DeepMind. Gemini Omni Flash — Model Card. Published May 19, 2026. Retrieved May 2026 from deepmind.google/models/model-cards/gemini-omni-flash
TechCrunch. Google's Gemini Omni turns images, audio, and text into video. Retrieved May 2026 from techcrunch.com/2026/05/19/googles-gemini-omni-turns-images-audio-and-text-into-video
Google. Introducing Gemini Omni. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni
Kōda (@xiaohu). Omni 接通了 Gemini 的世界知识库. X, May 19, 2026. x.com/xiaohu/status/2056880323607298286
TechTimes. Google Launches Gemini Omni Video Model, but Holds Back Its Riskiest Feature. Retrieved May 2026 from techtimes.com/articles/316859
Kōda (@aimikoda). Beat-synced portrait video prompt. X, May 19, 2026. x.com/aimikoda/status/2056865829782704534
沐阳 (@yyyole). Gemini 可以生成 10s 的视频（之前是 8s）. X, May 19, 2026. x.com/yyyole/status/2056883347373060197
Google. New agents, mobile apps and Gemini Omni for Google Flow and Google Flow Music. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/google-labs/flow-updates
Kōda (@aimikoda). Seedance 2.0 vs Gemini Omni, tested under the same conditions. X, May 19, 2026. x.com/aimikoda/status/2056840097455014017
TopviewAI (@TopviewAIhq). Side-by-side tests against Seedance 2.0. X, May 19, 2026. x.com/TopviewAIhq/status/2056795047685927337
el.cine (@EHuanglu). Gemini Omni is here. Its Nano Banana but for video. X, May 19, 2026. x.com/EHuanglu/status/2056798387647987941

TL;DR

Gemini Omni Flash is a native multimodal transformer that takes text, image, audio, and video as input and produces video with audio as output.^[2]
It replaces Veo 3.1 inside the Gemini app, ending Google's separate video-model brand.^[1]
Clips are capped at 10 seconds at launch, up from Veo's 8.^[3]
Available now to Google AI Plus, Pro, and Ultra subscribers, plus free on YouTube Shorts and Create. Developer API is "coming weeks."^[4]
Google deliberately held back voice and speech editing of existing audio, citing safety review.^[4]
Google's own listed limitations: complex motion, accurate text rendering, full consistency across edits.^[2]

Google. Gemini Omni — Create & edit videos as easy as having a conversation. Retrieved May 2026 from gemini.google/overview/video-generation
Google DeepMind. Gemini Omni Flash — Model Card. Published May 19, 2026. Retrieved May 2026 from deepmind.google/models/model-cards/gemini-omni-flash
TechCrunch. Google's Gemini Omni turns images, audio, and text into video. Retrieved May 2026 from techcrunch.com/2026/05/19/googles-gemini-omni-turns-images-audio-and-text-into-video
Google. Introducing Gemini Omni. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni
Kōda (@xiaohu). Omni 接通了 Gemini 的世界知识库. X, May 19, 2026. x.com/xiaohu/status/2056880323607298286
TechTimes. Google Launches Gemini Omni Video Model, but Holds Back Its Riskiest Feature. Retrieved May 2026 from techtimes.com/articles/316859
Kōda (@aimikoda). Beat-synced portrait video prompt. X, May 19, 2026. x.com/aimikoda/status/2056865829782704534
沐阳 (@yyyole). Gemini 可以生成 10s 的视频（之前是 8s）. X, May 19, 2026. x.com/yyyole/status/2056883347373060197
Google. New agents, mobile apps and Gemini Omni for Google Flow and Google Flow Music. Retrieved May 2026 from blog.google/innovation-and-ai/models-and-research/google-labs/flow-updates
Kōda (@aimikoda). Seedance 2.0 vs Gemini Omni, tested under the same conditions. X, May 19, 2026. x.com/aimikoda/status/2056840097455014017
TopviewAI (@TopviewAIhq). Side-by-side tests against Seedance 2.0. X, May 19, 2026. x.com/TopviewAIhq/status/2056795047685927337
el.cine (@EHuanglu). Gemini Omni is here. Its Nano Banana but for video. X, May 19, 2026. x.com/EHuanglu/status/2056798387647987941

Author

Categories

More Posts

Seedance 2.0 camera movement prompts: the complete guide to cinematic AI video

Seedance 2.0 vs Pika: Which AI video generator should you use?

How to Generate AI Images: A Practical Guide for 2026