[0:00] Hey friends, today's video is a [0:01] nobullshit guide on the current state of [0:03] AI video generation because if you [0:05] believe the headlines, the entire [0:07] Hollywood movie industry is about to be [0:09] replaced by AI in the next few minutes. [0:13] In reality though, we're not even close. [0:15] So, let's look beyond the flashy demos [0:16] that do a phenomenal job in maximizing [0:18] shareholder value but little else and go [0:21] over what's actually possible right now. [0:24] Let's get started. Instead of boring you [0:26] with technical jargon, let me just show [0:27] you what's going on. Starting off with a [0:29] simple analogy using Chacha BT. Let's [0:31] say I ask it to write the opening scene [0:33] of a TV show, right? I'll let it run and [0:36] in a couple of seconds it'll spit out a [0:38] script with the setting, the characters, [0:40] and all the good stuff. Simple enough, [0:42] right? Now, what happens when I ask CHBT [0:44] in the same chat to write the next [0:46] scene? Let's run it. And as you'll see [0:50] in the result, it quote unquote [0:52] remembers what happened in the opening [0:55] and continues the same narrative, right? [0:58] The characters are consistent, the [1:00] setting is consistent, the story is [1:03] consistent, and in a nutshell, that [1:05] consistency is the single biggest [1:07] roadblock when it comes to generating AI [1:09] videos. So, keep that keyword [1:10] consistency in mind for this next part. [1:13] Moving over to Google's Flow app, one of [1:15] the best AI generation tools for video [1:17] right now. Here, I have recreated a [1:20] scene starring Darth Vader. So, let's [1:22] play this back with audio. It's only 8 [1:23] seconds. [1:29] I am your mother. [1:31] >> Okay, first of all, notice how detailed [1:33] and realistic this is. Darth Vader is [1:35] walking towards us with all the right [1:36] sound effects. [1:38] >> I [1:39] >> sparks are flying out behind him because [1:41] he just cooked someone. Not sure if [1:43] that's how he was cooked. And his voice, [1:47] >> I am your mother. [1:49] >> His voice is pretty damn good. And guess [1:51] what? As long as you're willing to pay a [1:53] bit of money for the flow app and use [1:54] this prompt, which I'll share down [1:56] below, anyone can create this clip in [2:00] five minutes. So, the point I'm making [2:02] here is that AI video models are [2:04] insanely powerful. But if that's the [2:07] case, what exactly is holding us back [2:09] from producing Hollywood grade movies [2:11] and high production YouTube videos? [2:13] Here's the problem. Watch what happens [2:15] when I try to extend this scene like I [2:17] did with Chachi BT by using this prompt. [2:20] Next, Darth Vader raises his other arm [2:22] with a red lightsaber and says, "Get [2:24] ready for a spanking." Which is [2:25] something, you know, Mr. Vader would [2:27] say. And we're going to generate and [2:29] we're going to fast forward this next [2:32] part. [2:33] >> Get ready for a spanking. [2:37] >> Okay. Yeah, that didn't work at all. [2:39] Right. Um, the lightsaber was already in [2:41] the scene, right? And this is the wrong [2:44] hand. Uh, uh, Darth Vader doesn't even [2:46] look the same, right? Between scenes, [2:48] the voices are different. and the [2:51] background completely changed. This is a [2:53] perfect example of character [2:55] inconsistency. Editor Jeff here. Quick [2:58] heads up, OpenAI announced Sora 2 right [3:00] after I made this video, and they've [3:02] launched a few features targeting the [3:04] consistency problem. I'll explain what [3:06] these features actually do at the end. [3:08] But the bottom line is they do not [3:10] replace the need for the workflow I'm [3:12] about to show you. And with that, let's [3:14] dive back into the video. Put simply, [3:16] the video models do not remember any [3:19] details about the scenes they just [3:20] generated. Even if I used the exact same [3:24] prompt to describe Darth Vader again, [3:26] the model will still generate a slightly [3:28] different character, breaking the [3:31] consistency across scenes. But don't [3:33] worry, there is a way to overcome this. [3:35] Let's take a look at these two skits I [3:37] created from scratch. There's the first [3:39] one. [3:40] >> Google Gemini, can you find that email [3:42] from yesterday? No, but I can show you [3:44] an ad that looks like an email. You're [3:46] welcome. [3:48] >> All right. And the second one. [3:51] >> Hey, Gemini, can you play that YouTube [3:53] video from Jeff Sue? Absolutely. But [3:55] first, please enjoy this unskippable [3:57] 30inut ad. [3:58] >> Now, was that perfect? No. Given more [4:01] time, I could have made them much more [4:02] polished. But the key is that the [4:04] appearance and voice of the Gemini [4:07] mascot stayed the same or consistent [4:10] across scenes. And to achieve that level [4:12] of consistency, we just need to follow [4:14] four simple steps. Step one, generating [4:17] an image of our character. That's right. [4:19] Even though this video is all about AI [4:20] video, our very first step is to use an [4:23] image generation tool to create a static [4:25] image of our character. Normally, I use [4:27] Midjourney for this, but since [4:28] MidJourney is a paid app, we're going to [4:30] use Google's free image generation tool, [4:33] Whisk, for this tutorial. And at this [4:35] point, I want to be very clear. The [4:37] tools I mentioned in this video matter a [4:39] lot less than the workflow and [4:42] underlying logic. Okay, back in Whisk, [4:44] I'm just going to paste the prompt that [4:45] will generate the Gemini mascot [4:47] character. Don't worry, I'll share all [4:49] the prompts I use in this video down [4:50] below. And under settings, this is very [4:52] important. I'm going to disable precise [4:54] reference for now because I want the AI [4:56] to have more creative freedom. I'm going [4:59] to send this and let's fast forward this [5:01] next part. Okay, so the first two [5:04] results were already great. I just ran [5:05] it another time to show you that if you [5:07] weren't happy with the first batch, you [5:09] can just generate a few more batches um [5:11] until you find something that you like. [5:13] But for me, I'm just going to actually [5:15] go with this one right here. Um I like [5:17] the fact that it's like bigger and [5:19] larger and it's a full frontal photo of [5:22] the character, which might make uh the [5:24] future steps easier. Pro tip, if you [5:26] have an image that you mostly like, but [5:28] you want to change one specific thing, [5:29] you can do this. Simply click the refine [5:32] button and under settings, make sure [5:34] precise reference is enabled. Then we're [5:37] just going to describe the change you [5:39] want to make. For example, change the [5:40] color of the fur to white with pastel [5:42] orange gradients. And we're just going [5:44] to click generate. And we're going to [5:45] fast forward this next part. All right, [5:47] this looks really good, right? The [5:49] reason this works so well is because by [5:51] enabling precise reference, we're [5:53] telling Whisk to use Google's nano [5:55] banana image generation model, which is [5:57] fantastic at maintaining character [5:59] consistency in still images. If you [6:01] don't believe me, you can upload the [6:03] original image into the Google Gemini [6:05] web app with image editing enabled or [6:07] even Google's AI Studio. Use the exact [6:10] same prompt and you'll see that only the [6:12] fur color changes. The character stays [6:15] the same. Yes, all three of these [6:17] methods are free. And no, Google is not [6:19] sponsoring this video. Although, I [6:21] really wish they would. Maybe I'm just [6:23] not PC enough. Anyways, once you're [6:25] happy with the image, just click here to [6:27] download it. And we're now ready for [6:29] step two. By the way, I have a free AI [6:31] toolkit that cuts through the noise and [6:32] helps you master essential AI tools and [6:35] workflows. I'll leave a link to that [6:37] down below. Step two, create the [6:39] starting frame. All right, now that we [6:41] have our main character, it's time to [6:43] place him into a scene that we will [6:45] eventually turn into a video clip. [6:47] Staying right here in Whisk, we're just [6:48] going to expand out the sidebar. And we [6:50] can either just upload or just simply [6:52] drag the image from step one into this [6:55] character box. And by doing this, we're [6:58] basically telling Whisk, hey, do you see [7:00] this character right here? I want you to [7:03] include this exact character in the next [7:06] scene we generate. Because that's the [7:07] entire point of this, right? We want the [7:09] mascot to have the same appearance in [7:11] every single scene. After making sure [7:13] the subject is selected here, we're [7:16] going to go back into settings and make [7:17] sure that precise reference is enabled. [7:20] And then we're just going to use this [7:22] prompt, which again I'll share to [7:24] generate the still image of our starting [7:26] scene, which essentially depicts the [7:28] mascot talking to a female worker in an [7:31] office setting. Just like before, uh the [7:34] first batch was fine, these two, but [7:36] then I ran it one more time and I'm glad [7:38] I did because I actually like this one a [7:40] bit more. Um the entire mascot is in [7:42] frame versus this first option. And for [7:45] this one, the mouse the mouse is all [7:47] messed up, right? So, I'm just going to [7:48] go ahead and download this image. And [7:51] this is going to be the first frame of [7:53] our first video clip. Now, just to prove [7:55] to you how critical some of these [7:57] settings are, I'm going to deselect the [7:59] mascot from this subject. And I'm going [8:01] to turn off precise reference and use [8:03] the exact same prompt. And I'm going to [8:05] speed this up here. All right, these [8:07] look pretty terrible because as you can [8:08] see, without a reference image, Whisk [8:11] basically generated a mascot from [8:14] scratch. And these two don't even look [8:15] the same in the same batch. All right, [8:17] since we're creating two separate clips [8:19] for our skit, we just need to repeat the [8:21] exact same process to create the [8:22] starting frame for our second video. [8:24] I'll keep the mascot selected as the [8:27] subject. Make sure the precise reference [8:30] feature is toggled on and simply use a [8:34] different prompt. This time we're just [8:35] going to have the Gemini mascot interact [8:38] with a male co-orker. [8:40] All right, perfect. I think I like this [8:43] one the most. So, I'm just going to [8:44] download this. And now that we have our [8:46] two starting frames with our mascot [8:48] looking perfectly consistent in both, we [8:50] are finally ready to generate some video [8:52] footage. Step three, actually creating [8:54] the videos. To do this, we're going to [8:56] head on over to Google's Flow app. Quick [8:58] heads up, I am using the paid version of [9:00] Flow, so I have access to the V3 quality [9:03] model. Uh, but I actually tested this [9:05] with uh V3 fast, the model that free [9:08] users have access to, and it works the [9:10] exact same way. First, I'm going to [9:11] select the frame to video option and [9:14] upload our first starting frame, the one [9:17] with the female worker in it. Uh, click [9:21] crop to save. This tells the AI we're [9:25] giving it a still image that we want to [9:27] turn into an animated video. And once [9:30] that is uploaded, I am just going to [9:32] paste in this prompt uh that tells Flow [9:36] exactly how I want the scene to play out [9:38] from the dialogue to the action. Don't [9:41] worry though, we'll talk more about how [9:43] to write effective text to video prompts [9:45] in a little bit. Under settings, I want [9:47] this to be in landscape. That's fine. [9:49] And I want actually four outputs per [9:52] prompt. Yes, this eats up my credits a [9:54] lot faster, but it gives me a much [9:56] higher chance that at least one of the [9:58] outputs will be usable. You'll see what [9:59] I mean. And we're going to just hit [10:01] generate. [10:02] Okay, the videos are done. Let's go over [10:04] an example of a bad output. [10:07] >> Google Gemini, can you find that email [10:09] from yesterday? [10:10] >> No, but I can show you an ad that looks [10:12] like an email. You're welcome. [10:13] >> Okay, so obviously that doesn't work, [10:14] but because we had four outputs, at [10:16] least one of these would be good. [10:18] Luckily for me, all three of these other [10:20] ones were fine, but I like this one the [10:22] most. Let's play it back. [10:24] >> Google Gemini, can you find that email [10:25] from yesterday? [10:26] >> No, but I can show you an ad that looks [10:28] like an email. You're welcome. [10:30] >> So, I'm just going to favorite this and [10:32] then download it uh for the next step. [10:34] So, usually I would go with upscaled if [10:37] I were uploading to, let's say, YouTube, [10:38] for example. But since we're just going [10:40] over an example here, I'm just going to [10:42] choose the original size. Now, we just [10:44] rinse and repeat for our second scene. [10:45] We're going to keep frames to video [10:47] selected. then upload and select the [10:50] starting frame for our second clip. And [10:53] I'm just going to paste in this prompt. [10:54] I'm gonna hit generate. And we're just [10:56] going to fast forward to the next part. [10:58] All right, videos are done. And looking [11:00] through this batch, I like this one the [11:02] most. Hey Gemini, can you play that [11:04] YouTube video from Jeff Su? Absolutely. [11:06] But first, please enjoy this unskippable [11:09] 30inut ad. [11:10] >> All right, I'm going to favorite it and [11:12] download this in original size. And [11:14] let's actually play both clips back to [11:17] back. [11:18] >> Google Gemini, can you find that email [11:20] from yesterday? [11:20] >> No, but I can show you an ad that looks [11:22] like an email. You're welcome. [11:25] >> Hey Gemini, can you play that YouTube [11:27] video from Jeff Sue? [11:28] >> Absolutely. But first, please enjoy this [11:31] unskippable 30inut ad. [11:33] >> Again, definitely not perfect, but the [11:35] important thing is the Gemini mascot [11:37] looks the same across both clips. But [11:40] there's another issue. The voice of the [11:43] Gemini mascot character is completely [11:45] different in the two scenes. But don't [11:48] worry, we're going to fix that in the [11:49] next step. But before we do, let me [11:51] share my process for writing text to [11:53] video prompts. In a nutshell, I created [11:55] a Gemini gem that basically takes user [11:58] input and outputs an optimized video [12:01] prompt. I've also uploaded video [12:04] prompting best practices as knowledge [12:07] files. After starting a new chat in the [12:09] gem, I would first upload the starting [12:12] frame image and a screenshot of the flow [12:14] app to give it additional context. Then [12:16] I just describe the scene I want, give [12:19] it a script, which I obviously had to [12:22] come up with, and the Gemini gem will [12:24] then write a detailed prompt optimized [12:27] for Google's VO model. I'll link this [12:30] below for you to try for free, but let [12:31] me know in the comments if you want a [12:32] full video on how to create like [12:34] powerful Gemini gems and custom GPTs [12:36] because it's actually a lot of work to [12:38] create something really good. Step four. [12:40] All right, our final step is to give our [12:42] mascot a consistent voice across both [12:44] scenes. For this, we're going to use a [12:45] tool called 11 Labs. Once you're signed [12:48] in, navigate to the voice changer option [12:51] on the left here, and we're just going [12:52] to upload the video file for scene one [12:56] that we downloaded from Flow. And then [12:58] we're just going to choose a voice we [13:00] want to change our character's voice to. [13:02] And I decided on the Malvorex, the [13:05] monster voice, which sounds about right. [13:08] And then I'm just going to click [13:09] generate speech. Wait for a little bit. [13:11] Okay, let's play this back. [13:14] >> Google Gemini, can you find that email [13:16] from yesterday? No, but I can show you [13:18] an ad that looks like an email. You're [13:19] welcome. [13:20] >> Okay, you'll notice that the mascot's [13:22] voice and the female professional voices [13:24] have both been changed, but that's okay. [13:27] Part of the plan, we can now download [13:29] this new audio. Next, we're going to do [13:30] the exact same thing for scene two. [13:33] We're going to upload the video. And the [13:34] most important part is that we select [13:36] the exact same voice, of course, the [13:38] monster, because the whole point is to [13:40] keep the mascot's voice consistent. And [13:42] we're going to click generate speech. [13:44] Okay, I played it back. It sounds good. [13:45] We're just going to download this to use [13:47] in the next and final step. Now, for the [13:49] final bit of magic, we need to bring the [13:51] original video clips, the ones with [13:53] inconsistent audio from Flow and the new [13:56] audio files from 11 Labs into a video [14:00] editing tool like Final Cut Pro. First, [14:03] we're going to detach the original [14:05] inconsistent audio from both clips. And [14:08] then, we're going to bring in the two [14:10] new audio files we just generated with [14:13] 11 Labs. And here's a key step. I'm [14:15] going to manually replace only the [14:17] mascots's lines with the new consistent [14:19] monster voice. This way, the human [14:22] actors keep their original voices, but [14:24] our mascot now sounds exactly the same [14:26] across both scenes. And as a final touch [14:29] to really sell the scene, we can layer [14:30] in some subtle ambient office sound [14:33] effects in the background. [14:35] >> Google Gemini, can you find that email [14:37] from yesterday? [14:38] >> No, but I can show you an ad that looks [14:40] like an email. You're welcome. [14:42] Hey Gemini, can you play that YouTube [14:44] video from Jeff Sue? [14:45] >> Absolutely. But first, please enjoy this [14:48] unskippable 30inut ad. [14:50] >> And with that, we've successfully [14:51] created a multi-seen skit with an AI [14:54] character that is both visually and [14:56] audibly consistent. A few things I want [14:58] to leave you with. First, it's totally [15:00] possible to have two or more consistent [15:02] characters across scenes. Simply upload [15:04] two or more subjects into Whisk, [15:06] describe the scene, and use that as your [15:08] starting frame. The principle is the [15:10] same. Second, let's talk about [15:12] thirdparty tools. There are capable AI [15:14] video tools out there like Open Art, [15:16] Hyalura, and Cling that market [15:17] themselves as all-in-one solutions. [15:20] These tools do make the video generation [15:22] process easier, but in order to produce [15:24] polished videos, there's still a ton of [15:26] manual work involved, like generating [15:28] the initial character and fixing the [15:30] audio. And not to mention, those tools [15:32] aren't exactly easy to use for the [15:34] average person. So, here's the bottom [15:36] line. Again, video models have gotten [15:38] extremely powerful, but AI video tools [15:40] are just that, tools. We need to learn [15:43] what each tool is good for and build a [15:45] workflow that combines the strengths of [15:48] each one. Just think about what we did [15:49] today. First, we used Whisk to generate [15:51] our character. Then, we used Whisk again [15:53] to create the starting frame. Then, we [15:55] used a custom Gemini Gem to write a [15:57] textto video prompt. We used Flow to [16:00] actually generate the video. Then, we [16:02] used 11 Labs to generate consistent [16:04] audio. And after all of that, we still [16:07] had to use a video editor to piece it [16:09] all together. All right, so about Sora [16:11] 2, they announced two features. The [16:13] first one is called Cameo, which uses a [16:15] recording of your actual face and voice [16:17] to keep your likeness consistent across [16:19] scenes. The issue here is Cameo only [16:22] works with real people and pets, so it's [16:24] very limited in the characters we can [16:26] actually create. The second feature is [16:29] called recut, which lets you load the [16:31] last few seconds of a clip into your [16:33] next prompt to maintain continuity. If [16:35] this works as intended, it is a big [16:38] deal, but it's just one step in the [16:40] workflow. We still need to generate the [16:42] character, write robust video prompts, [16:44] fix the audio, etc. So, yeah, these seem [16:46] like awesome features, but they're just [16:48] that, features that need to be [16:50] integrated within a broader workflow. [16:53] Let me know if you want a full tutorial [16:54] on Sora 2. See you on the next video. [16:56] And in the meantime, [16:58] have a great one.