[0:00] Hey friends, today's video is a
[0:01] nobullshit guide on the current state of
[0:03] AI video generation because if you
[0:05] believe the headlines, the entire
[0:07] Hollywood movie industry is about to be
[0:09] replaced by AI in the next few minutes.
[0:13] In reality though, we're not even close.
[0:15] So, let's look beyond the flashy demos
[0:16] that do a phenomenal job in maximizing
[0:18] shareholder value but little else and go
[0:21] over what's actually possible right now.
[0:24] Let's get started. Instead of boring you
[0:26] with technical jargon, let me just show
[0:27] you what's going on. Starting off with a
[0:29] simple analogy using Chacha BT. Let's
[0:31] say I ask it to write the opening scene
[0:33] of a TV show, right? I'll let it run and
[0:36] in a couple of seconds it'll spit out a
[0:38] script with the setting, the characters,
[0:40] and all the good stuff. Simple enough,
[0:42] right? Now, what happens when I ask CHBT
[0:44] in the same chat to write the next
[0:46] scene? Let's run it. And as you'll see
[0:50] in the result, it quote unquote
[0:52] remembers what happened in the opening
[0:55] and continues the same narrative, right?
[0:58] The characters are consistent, the
[1:00] setting is consistent, the story is
[1:03] consistent, and in a nutshell, that
[1:05] consistency is the single biggest
[1:07] roadblock when it comes to generating AI
[1:09] videos. So, keep that keyword
[1:10] consistency in mind for this next part.
[1:13] Moving over to Google's Flow app, one of
[1:15] the best AI generation tools for video
[1:17] right now. Here, I have recreated a
[1:20] scene starring Darth Vader. So, let's
[1:22] play this back with audio. It's only 8
[1:23] seconds.
[1:29] I am your mother.
[1:31] >> Okay, first of all, notice how detailed
[1:33] and realistic this is. Darth Vader is
[1:35] walking towards us with all the right
[1:36] sound effects.
[1:38] >> I
[1:39] >> sparks are flying out behind him because
[1:41] he just cooked someone. Not sure if
[1:43] that's how he was cooked. And his voice,
[1:47] >> I am your mother.
[1:49] >> His voice is pretty damn good. And guess
[1:51] what? As long as you're willing to pay a
[1:53] bit of money for the flow app and use
[1:54] this prompt, which I'll share down
[1:56] below, anyone can create this clip in
[2:00] five minutes. So, the point I'm making
[2:02] here is that AI video models are
[2:04] insanely powerful. But if that's the
[2:07] case, what exactly is holding us back
[2:09] from producing Hollywood grade movies
[2:11] and high production YouTube videos?
[2:13] Here's the problem. Watch what happens
[2:15] when I try to extend this scene like I
[2:17] did with Chachi BT by using this prompt.
[2:20] Next, Darth Vader raises his other arm
[2:22] with a red lightsaber and says, "Get
[2:24] ready for a spanking." Which is
[2:25] something, you know, Mr. Vader would
[2:27] say. And we're going to generate and
[2:29] we're going to fast forward this next
[2:32] part.
[2:33] >> Get ready for a spanking.
[2:37] >> Okay. Yeah, that didn't work at all.
[2:39] Right. Um, the lightsaber was already in
[2:41] the scene, right? And this is the wrong
[2:44] hand. Uh, uh, Darth Vader doesn't even
[2:46] look the same, right? Between scenes,
[2:48] the voices are different. and the
[2:51] background completely changed. This is a
[2:53] perfect example of character
[2:55] inconsistency. Editor Jeff here. Quick
[2:58] heads up, OpenAI announced Sora 2 right
[3:00] after I made this video, and they've
[3:02] launched a few features targeting the
[3:04] consistency problem. I'll explain what
[3:06] these features actually do at the end.
[3:08] But the bottom line is they do not
[3:10] replace the need for the workflow I'm
[3:12] about to show you. And with that, let's
[3:14] dive back into the video. Put simply,
[3:16] the video models do not remember any
[3:19] details about the scenes they just
[3:20] generated. Even if I used the exact same
[3:24] prompt to describe Darth Vader again,
[3:26] the model will still generate a slightly
[3:28] different character, breaking the
[3:31] consistency across scenes. But don't
[3:33] worry, there is a way to overcome this.
[3:35] Let's take a look at these two skits I
[3:37] created from scratch. There's the first
[3:39] one.
[3:40] >> Google Gemini, can you find that email
[3:42] from yesterday? No, but I can show you
[3:44] an ad that looks like an email. You're
[3:46] welcome.
[3:48] >> All right. And the second one.
[3:51] >> Hey, Gemini, can you play that YouTube
[3:53] video from Jeff Sue? Absolutely. But
[3:55] first, please enjoy this unskippable
[3:57] 30inut ad.
[3:58] >> Now, was that perfect? No. Given more
[4:01] time, I could have made them much more
[4:02] polished. But the key is that the
[4:04] appearance and voice of the Gemini
[4:07] mascot stayed the same or consistent
[4:10] across scenes. And to achieve that level
[4:12] of consistency, we just need to follow
[4:14] four simple steps. Step one, generating
[4:17] an image of our character. That's right.
[4:19] Even though this video is all about AI
[4:20] video, our very first step is to use an
[4:23] image generation tool to create a static
[4:25] image of our character. Normally, I use
[4:27] Midjourney for this, but since
[4:28] MidJourney is a paid app, we're going to
[4:30] use Google's free image generation tool,
[4:33] Whisk, for this tutorial. And at this
[4:35] point, I want to be very clear. The
[4:37] tools I mentioned in this video matter a
[4:39] lot less than the workflow and
[4:42] underlying logic. Okay, back in Whisk,
[4:44] I'm just going to paste the prompt that
[4:45] will generate the Gemini mascot
[4:47] character. Don't worry, I'll share all
[4:49] the prompts I use in this video down
[4:50] below. And under settings, this is very
[4:52] important. I'm going to disable precise
[4:54] reference for now because I want the AI
[4:56] to have more creative freedom. I'm going
[4:59] to send this and let's fast forward this
[5:01] next part. Okay, so the first two
[5:04] results were already great. I just ran
[5:05] it another time to show you that if you
[5:07] weren't happy with the first batch, you
[5:09] can just generate a few more batches um
[5:11] until you find something that you like.
[5:13] But for me, I'm just going to actually
[5:15] go with this one right here. Um I like
[5:17] the fact that it's like bigger and
[5:19] larger and it's a full frontal photo of
[5:22] the character, which might make uh the
[5:24] future steps easier. Pro tip, if you
[5:26] have an image that you mostly like, but
[5:28] you want to change one specific thing,
[5:29] you can do this. Simply click the refine
[5:32] button and under settings, make sure
[5:34] precise reference is enabled. Then we're
[5:37] just going to describe the change you
[5:39] want to make. For example, change the
[5:40] color of the fur to white with pastel
[5:42] orange gradients. And we're just going
[5:44] to click generate. And we're going to
[5:45] fast forward this next part. All right,
[5:47] this looks really good, right? The
[5:49] reason this works so well is because by
[5:51] enabling precise reference, we're
[5:53] telling Whisk to use Google's nano
[5:55] banana image generation model, which is
[5:57] fantastic at maintaining character
[5:59] consistency in still images. If you
[6:01] don't believe me, you can upload the
[6:03] original image into the Google Gemini
[6:05] web app with image editing enabled or
[6:07] even Google's AI Studio. Use the exact
[6:10] same prompt and you'll see that only the
[6:12] fur color changes. The character stays
[6:15] the same. Yes, all three of these
[6:17] methods are free. And no, Google is not
[6:19] sponsoring this video. Although, I
[6:21] really wish they would. Maybe I'm just
[6:23] not PC enough. Anyways, once you're
[6:25] happy with the image, just click here to
[6:27] download it. And we're now ready for
[6:29] step two. By the way, I have a free AI
[6:31] toolkit that cuts through the noise and
[6:32] helps you master essential AI tools and
[6:35] workflows. I'll leave a link to that
[6:37] down below. Step two, create the
[6:39] starting frame. All right, now that we
[6:41] have our main character, it's time to
[6:43] place him into a scene that we will
[6:45] eventually turn into a video clip.
[6:47] Staying right here in Whisk, we're just
[6:48] going to expand out the sidebar. And we
[6:50] can either just upload or just simply
[6:52] drag the image from step one into this
[6:55] character box. And by doing this, we're
[6:58] basically telling Whisk, hey, do you see
[7:00] this character right here? I want you to
[7:03] include this exact character in the next
[7:06] scene we generate. Because that's the
[7:07] entire point of this, right? We want the
[7:09] mascot to have the same appearance in
[7:11] every single scene. After making sure
[7:13] the subject is selected here, we're
[7:16] going to go back into settings and make
[7:17] sure that precise reference is enabled.
[7:20] And then we're just going to use this
[7:22] prompt, which again I'll share to
[7:24] generate the still image of our starting
[7:26] scene, which essentially depicts the
[7:28] mascot talking to a female worker in an
[7:31] office setting. Just like before, uh the
[7:34] first batch was fine, these two, but
[7:36] then I ran it one more time and I'm glad
[7:38] I did because I actually like this one a
[7:40] bit more. Um the entire mascot is in
[7:42] frame versus this first option. And for
[7:45] this one, the mouse the mouse is all
[7:47] messed up, right? So, I'm just going to
[7:48] go ahead and download this image. And
[7:51] this is going to be the first frame of
[7:53] our first video clip. Now, just to prove
[7:55] to you how critical some of these
[7:57] settings are, I'm going to deselect the
[7:59] mascot from this subject. And I'm going
[8:01] to turn off precise reference and use
[8:03] the exact same prompt. And I'm going to
[8:05] speed this up here. All right, these
[8:07] look pretty terrible because as you can
[8:08] see, without a reference image, Whisk
[8:11] basically generated a mascot from
[8:14] scratch. And these two don't even look
[8:15] the same in the same batch. All right,
[8:17] since we're creating two separate clips
[8:19] for our skit, we just need to repeat the
[8:21] exact same process to create the
[8:22] starting frame for our second video.
[8:24] I'll keep the mascot selected as the
[8:27] subject. Make sure the precise reference
[8:30] feature is toggled on and simply use a
[8:34] different prompt. This time we're just
[8:35] going to have the Gemini mascot interact
[8:38] with a male co-orker.
[8:40] All right, perfect. I think I like this
[8:43] one the most. So, I'm just going to
[8:44] download this. And now that we have our
[8:46] two starting frames with our mascot
[8:48] looking perfectly consistent in both, we
[8:50] are finally ready to generate some video
[8:52] footage. Step three, actually creating
[8:54] the videos. To do this, we're going to
[8:56] head on over to Google's Flow app. Quick
[8:58] heads up, I am using the paid version of
[9:00] Flow, so I have access to the V3 quality
[9:03] model. Uh, but I actually tested this
[9:05] with uh V3 fast, the model that free
[9:08] users have access to, and it works the
[9:10] exact same way. First, I'm going to
[9:11] select the frame to video option and
[9:14] upload our first starting frame, the one
[9:17] with the female worker in it. Uh, click
[9:21] crop to save. This tells the AI we're
[9:25] giving it a still image that we want to
[9:27] turn into an animated video. And once
[9:30] that is uploaded, I am just going to
[9:32] paste in this prompt uh that tells Flow
[9:36] exactly how I want the scene to play out
[9:38] from the dialogue to the action. Don't
[9:41] worry though, we'll talk more about how
[9:43] to write effective text to video prompts
[9:45] in a little bit. Under settings, I want
[9:47] this to be in landscape. That's fine.
[9:49] And I want actually four outputs per
[9:52] prompt. Yes, this eats up my credits a
[9:54] lot faster, but it gives me a much
[9:56] higher chance that at least one of the
[9:58] outputs will be usable. You'll see what
[9:59] I mean. And we're going to just hit
[10:01] generate.
[10:02] Okay, the videos are done. Let's go over
[10:04] an example of a bad output.
[10:07] >> Google Gemini, can you find that email
[10:09] from yesterday?
[10:10] >> No, but I can show you an ad that looks
[10:12] like an email. You're welcome.
[10:13] >> Okay, so obviously that doesn't work,
[10:14] but because we had four outputs, at
[10:16] least one of these would be good.
[10:18] Luckily for me, all three of these other
[10:20] ones were fine, but I like this one the
[10:22] most. Let's play it back.
[10:24] >> Google Gemini, can you find that email
[10:25] from yesterday?
[10:26] >> No, but I can show you an ad that looks
[10:28] like an email. You're welcome.
[10:30] >> So, I'm just going to favorite this and
[10:32] then download it uh for the next step.
[10:34] So, usually I would go with upscaled if
[10:37] I were uploading to, let's say, YouTube,
[10:38] for example. But since we're just going
[10:40] over an example here, I'm just going to
[10:42] choose the original size. Now, we just
[10:44] rinse and repeat for our second scene.
[10:45] We're going to keep frames to video
[10:47] selected. then upload and select the
[10:50] starting frame for our second clip. And
[10:53] I'm just going to paste in this prompt.
[10:54] I'm gonna hit generate. And we're just
[10:56] going to fast forward to the next part.
[10:58] All right, videos are done. And looking
[11:00] through this batch, I like this one the
[11:02] most. Hey Gemini, can you play that
[11:04] YouTube video from Jeff Su? Absolutely.
[11:06] But first, please enjoy this unskippable
[11:09] 30inut ad.
[11:10] >> All right, I'm going to favorite it and
[11:12] download this in original size. And
[11:14] let's actually play both clips back to
[11:17] back.
[11:18] >> Google Gemini, can you find that email
[11:20] from yesterday?
[11:20] >> No, but I can show you an ad that looks
[11:22] like an email. You're welcome.
[11:25] >> Hey Gemini, can you play that YouTube
[11:27] video from Jeff Sue?
[11:28] >> Absolutely. But first, please enjoy this
[11:31] unskippable 30inut ad.
[11:33] >> Again, definitely not perfect, but the
[11:35] important thing is the Gemini mascot
[11:37] looks the same across both clips. But
[11:40] there's another issue. The voice of the
[11:43] Gemini mascot character is completely
[11:45] different in the two scenes. But don't
[11:48] worry, we're going to fix that in the
[11:49] next step. But before we do, let me
[11:51] share my process for writing text to
[11:53] video prompts. In a nutshell, I created
[11:55] a Gemini gem that basically takes user
[11:58] input and outputs an optimized video
[12:01] prompt. I've also uploaded video
[12:04] prompting best practices as knowledge
[12:07] files. After starting a new chat in the
[12:09] gem, I would first upload the starting
[12:12] frame image and a screenshot of the flow
[12:14] app to give it additional context. Then
[12:16] I just describe the scene I want, give
[12:19] it a script, which I obviously had to
[12:22] come up with, and the Gemini gem will
[12:24] then write a detailed prompt optimized
[12:27] for Google's VO model. I'll link this
[12:30] below for you to try for free, but let
[12:31] me know in the comments if you want a
[12:32] full video on how to create like
[12:34] powerful Gemini gems and custom GPTs
[12:36] because it's actually a lot of work to
[12:38] create something really good. Step four.
[12:40] All right, our final step is to give our
[12:42] mascot a consistent voice across both
[12:44] scenes. For this, we're going to use a
[12:45] tool called 11 Labs. Once you're signed
[12:48] in, navigate to the voice changer option
[12:51] on the left here, and we're just going
[12:52] to upload the video file for scene one
[12:56] that we downloaded from Flow. And then
[12:58] we're just going to choose a voice we
[13:00] want to change our character's voice to.
[13:02] And I decided on the Malvorex, the
[13:05] monster voice, which sounds about right.
[13:08] And then I'm just going to click
[13:09] generate speech. Wait for a little bit.
[13:11] Okay, let's play this back.
[13:14] >> Google Gemini, can you find that email
[13:16] from yesterday? No, but I can show you
[13:18] an ad that looks like an email. You're
[13:19] welcome.
[13:20] >> Okay, you'll notice that the mascot's
[13:22] voice and the female professional voices
[13:24] have both been changed, but that's okay.
[13:27] Part of the plan, we can now download
[13:29] this new audio. Next, we're going to do
[13:30] the exact same thing for scene two.
[13:33] We're going to upload the video. And the
[13:34] most important part is that we select
[13:36] the exact same voice, of course, the
[13:38] monster, because the whole point is to
[13:40] keep the mascot's voice consistent. And
[13:42] we're going to click generate speech.
[13:44] Okay, I played it back. It sounds good.
[13:45] We're just going to download this to use
[13:47] in the next and final step. Now, for the
[13:49] final bit of magic, we need to bring the
[13:51] original video clips, the ones with
[13:53] inconsistent audio from Flow and the new
[13:56] audio files from 11 Labs into a video
[14:00] editing tool like Final Cut Pro. First,
[14:03] we're going to detach the original
[14:05] inconsistent audio from both clips. And
[14:08] then, we're going to bring in the two
[14:10] new audio files we just generated with
[14:13] 11 Labs. And here's a key step. I'm
[14:15] going to manually replace only the
[14:17] mascots's lines with the new consistent
[14:19] monster voice. This way, the human
[14:22] actors keep their original voices, but
[14:24] our mascot now sounds exactly the same
[14:26] across both scenes. And as a final touch
[14:29] to really sell the scene, we can layer
[14:30] in some subtle ambient office sound
[14:33] effects in the background.
[14:35] >> Google Gemini, can you find that email
[14:37] from yesterday?
[14:38] >> No, but I can show you an ad that looks
[14:40] like an email. You're welcome.
[14:42] Hey Gemini, can you play that YouTube
[14:44] video from Jeff Sue?
[14:45] >> Absolutely. But first, please enjoy this
[14:48] unskippable 30inut ad.
[14:50] >> And with that, we've successfully
[14:51] created a multi-seen skit with an AI
[14:54] character that is both visually and
[14:56] audibly consistent. A few things I want
[14:58] to leave you with. First, it's totally
[15:00] possible to have two or more consistent
[15:02] characters across scenes. Simply upload
[15:04] two or more subjects into Whisk,
[15:06] describe the scene, and use that as your
[15:08] starting frame. The principle is the
[15:10] same. Second, let's talk about
[15:12] thirdparty tools. There are capable AI
[15:14] video tools out there like Open Art,
[15:16] Hyalura, and Cling that market
[15:17] themselves as all-in-one solutions.
[15:20] These tools do make the video generation
[15:22] process easier, but in order to produce
[15:24] polished videos, there's still a ton of
[15:26] manual work involved, like generating
[15:28] the initial character and fixing the
[15:30] audio. And not to mention, those tools
[15:32] aren't exactly easy to use for the
[15:34] average person. So, here's the bottom
[15:36] line. Again, video models have gotten
[15:38] extremely powerful, but AI video tools
[15:40] are just that, tools. We need to learn
[15:43] what each tool is good for and build a
[15:45] workflow that combines the strengths of
[15:48] each one. Just think about what we did
[15:49] today. First, we used Whisk to generate
[15:51] our character. Then, we used Whisk again
[15:53] to create the starting frame. Then, we
[15:55] used a custom Gemini Gem to write a
[15:57] textto video prompt. We used Flow to
[16:00] actually generate the video. Then, we
[16:02] used 11 Labs to generate consistent
[16:04] audio. And after all of that, we still
[16:07] had to use a video editor to piece it
[16:09] all together. All right, so about Sora
[16:11] 2, they announced two features. The
[16:13] first one is called Cameo, which uses a
[16:15] recording of your actual face and voice
[16:17] to keep your likeness consistent across
[16:19] scenes. The issue here is Cameo only
[16:22] works with real people and pets, so it's
[16:24] very limited in the characters we can
[16:26] actually create. The second feature is
[16:29] called recut, which lets you load the
[16:31] last few seconds of a clip into your
[16:33] next prompt to maintain continuity. If
[16:35] this works as intended, it is a big
[16:38] deal, but it's just one step in the
[16:40] workflow. We still need to generate the
[16:42] character, write robust video prompts,
[16:44] fix the audio, etc. So, yeah, these seem
[16:46] like awesome features, but they're just
[16:48] that, features that need to be
[16:50] integrated within a broader workflow.
[16:53] Let me know if you want a full tutorial
[16:54] on Sora 2. See you on the next video.
[16:56] And in the meantime,
[16:58] have a great one.