The Future of AI Video Has Arrived! (Stable Diffusion Video Tutorial/Walkthrough)

Theoretically Media
28 Nov 202310:36

TLDRThe video introduces Stable Diffusion Video, a new AI video model from Stability AI that generates short video clips from images. The model produces 25 frames at a resolution of 576x1024, with another version running at 14 frames. The video showcases the model's output, which is of high fidelity and quality, even though it may require upscaling and interpolation for better results. The video also discusses the model's understanding of 3D space, which allows for coherent faces and characters. Users have several options to run the model, including locally using Pinocchio, which supports Nvidia GPUs, or through online platforms like Hugging Face and Replicate. The video also mentions upcoming improvements such as text-to-video, 3D mapping, and longer video outputs. Additionally, it highlights the use of Final Frame, a tool for extending video clips by merging AI-generated images with existing video content.


  • πŸš€ New AI video model from Stability AI generates short video clips from image conditioning.
  • πŸ–₯️ The model is capable of generating 25 frames at a resolution of 576x1024, with another fine-tune model running at 14 frames.
  • 🌟 Videos produced by the model have high fidelity and quality, with examples showing 2-3 seconds of impressive visuals.
  • πŸš— Outputs can be improved with upscaling and interpolation, with Topaz being used for comparison in the script.
  • πŸ“ˆ The model's performance is showcased in a side-by-side comparison with other image-to-video platforms, highlighting its motion and action capabilities.
  • πŸŽ₯ Lack of camera controls is a current limitation, but custom Luras are expected to add these functionalities soon.
  • πŸ“Š Controls for overall motion level are available, with different settings shown to affect the speed and dynamics of the video.
  • πŸ€– The model demonstrates an understanding of 3D space, which is crucial for coherent faces and character animations.
  • πŸ’» For local use, Pinocchio is recommended for one-click installation, but currently only supports Nvidia GPUs.
  • 🌐 Hugging Face and Replicate offer options to try the model online, with Replicate providing free initial generations and a pay-as-you-go model.
  • πŸ“ˆ Users can upscale and interpolate videos using tools like R Video Interpolation, enhancing the final output quality.
  • πŸ” Ongoing improvements to the model include text-to-video, 3D mapping, and longer video outputs to address current limitations.

Q & A

  • What is the name of the AI video model discussed in the video?

    -The AI video model discussed is called Stable Diffusion Video.

  • What is the current limitation of the Stable Diffusion Video model in terms of frames and resolution?

    -The current limitation is that it generates 25 frames at a resolution of 576 by 1024. There is also a fine-tune model that runs at 14 frames.

  • What is the expected future feature for the Stable Diffusion Video model?

    -Text to video is an expected future feature that has not been released yet.

  • How long do the generated video clips from Stable Diffusion Video typically run?

    -The generated video clips typically run for about 2 to 3 seconds.

  • What tool was used to upscale and interpolate the outputs from Stable Diffusion Video in the example provided?

    -Topaz was used to upscale and interpolate the outputs from Stable Diffusion Video.

  • What is the significance of Stable Diffusion Video's understanding of 3D space?

    -Its understanding of 3D space allows for more coherent faces and characters in the generated videos, leading to more realistic and consistent environments across different shots.

  • What are some of the controls available for adjusting the output of Stable Diffusion Video?

    -Controls include the overall level of motion, aspect ratio selection, frames per second to adjust the output length, and motion bucket to control the amount of motion in the video.

  • How can one try out Stable Diffusion Video for free?

    -One can try out Stable Diffusion Video for free on Hugging Face by uploading an image and generating a video.

  • What is the name of the tool that allows users to extend their video clips generated by Stable Diffusion Video?

    -The tool is called Final Frame.

  • What is the main challenge when using Final Frame to merge video clips?

    -The main challenge is that as of the time of the video, the save project, open project, and new project features do not work, so users have to be careful not to lose their work if they close their browser.

  • What is the current status of camera controls in Stable Diffusion Video?

    -As of the time of the video, camera controls are not yet available in Stable Diffusion Video, but they will be coming soon via custom luras.

  • What improvements are being made to the Stable Diffusion Video model?

    -Improvements being made include text video, 3D mapping, and longer video outputs.



πŸš€ Introduction to Stable Diffusion Video

This paragraph introduces the new Stable Diffusion video model from Stability AI. It emphasizes that the model can generate short, high-quality video clips from images, contrary to common misconceptions about the complexity and hardware requirements of Stable Diffusion. The speaker also mentions that text-to-video functionality is coming soon. The model is described as being trained to generate 25 frames at a resolution of 576x1024. The paragraph includes an example video clip to showcase the impressive fidelity and quality achievable with the model. It also discusses the effects of upscaling and interpolation using Topaz, as well as comparing Stable Diffusion video to other image-to-video platforms in terms of action and motion. The lack of camera controls is noted, but the speaker assures they will be added soon through custom LUTs.


πŸ“Š Running Stable Diffusion Video on Different Platforms

This paragraph discusses various ways to run the Stable Diffusion video model, including running it locally using Pinocchio, trying it out for free on Hugging Face, and using the Replicate platform. The speaker provides step-by-step instructions for using Pinocchio, noting that it currently only supports Nvidia GPUs. The Hugging Face option is mentioned, but the speaker warns that too many user errors may occur due to high demand. Replicate is presented as a non-local option where users can run multiple generations for free before being asked to pay a reasonable fee. The speaker explains the different parameters that can be adjusted in Replicate, such as frame count, aspect ratio, frames per second, motion, and conditional augmentation. The paragraph also touches on video upscaling and interpolation using other tools like R-Video Interpolation.


πŸ” Extending Video Clips with Final Frame

The final paragraph discusses how to extend the short video clips generated by Stable Diffusion using the Final Frame tool. The speaker explains that the creator of Final Frame, Benjamin Deer, has added an AI image-to-video tab where users can upload an image, process it, and then add more video clips to create a longer, continuous video. The speaker demonstrates how to merge clips together, rearrange them on the timeline, and export the final video. However, they note that some features like saving and opening projects are not yet functional. The speaker encourages viewers to provide feedback to help improve Final Frame, highlighting that it is an indie project developed by a community member.



πŸ’‘AI video model

An AI video model refers to an artificial intelligence system designed to generate or manipulate video content. In the context of the video, it is used to create short video clips from images, which is a significant advancement in the field of AI and content creation. The script mentions that the new AI video model from Stability has just been released and is capable of generating high-quality video clips.

πŸ’‘Stable Diffusion

Stable Diffusion is the name given to the AI model discussed in the video, developed by Stability AI. It is trained to generate short video clips from a single image, which is a notable achievement in the realm of AI-generated media. The video emphasizes that despite potential misconceptions about the complexity or hardware requirements, Stable Diffusion offers a user-friendly approach to creating dynamic video content.


A GPU, or Graphics Processing Unit, is a specialized electronic circuit designed to accelerate the creation of images in a frame buffer intended for output to a display device. In the video script, it is mentioned that a powerful GPU is typically required to run AI video models, but the presenter reassures viewers that they will provide solutions even for those without such hardware.

πŸ’‘Image to video

Image to video is a process where still images are transformed into a video format, often by adding motion or transitions between the frames. The video script discusses that Stable Diffusion's model is currently capable of image to video generation, with text to video functionality on the horizon. This process is significant as it allows for the creation of video content from static images, expanding the possibilities for video production.


Resolution refers to the number of pixels that are displayed on a screen or image, with higher resolution indicating more detail. In the context of the video, the AI model is trained to generate videos at a resolution of 576 by 1024 pixels, which is important for the quality of the generated video clips.


Topaz is a software suite that provides various image and video enhancement tools, such as upscaling and interpolation. The script mentions that outputs from Stable Diffusion were upscaled and interpolated by Topaz, demonstrating an improvement in the quality of the generated videos. Topaz is used to enhance the resolution and detail of the AI-generated content.

πŸ’‘Motion control

Motion control in the context of the video refers to the ability to manipulate the level of movement or animation within the generated video clips. The script discusses different levels of motion control, such as motion 50, 180, and 255, which affect the dynamics and speed of the video, allowing for more creative and dynamic results.

πŸ’‘3D space understanding

Understanding of 3D space is the AI model's capability to interpret and generate content with a sense of depth and spatial awareness. The video script illustrates this with examples of more coherent faces and characters, suggesting that the AI can create more realistic and immersive video content by simulating a three-dimensional environment.


Pinocchio, in the context of the video, is a software tool that allows for the local running of Stable Diffusion video. It is mentioned as a one-click installation option for users with Nvidia GPUs, making it a convenient choice for those looking to generate videos without relying on cloud-based services.

πŸ’‘Hugging Face

Hugging Face is a company that provides a platform for AI models, including Stable Diffusion video. The script mentions it as an option for users to try out the AI video model for free, although with potential limitations due to high user demand. It represents an accessible entry point for users to experiment with AI-generated video content.


Replicate is a platform mentioned in the video that allows users to run generations of the Stable Diffusion video model with a pay-what-you-use model. It offers a balance between accessibility and cost, with a reasonable pricing structure, making it an attractive option for users looking to generate AI videos without significant upfront investment.

πŸ’‘Final Frame

Final Frame is a tool discussed in the video that enables users to extend their AI-generated video clips by merging them with other video content. It is a project developed by an individual creator and allows for the creation of longer, more complex videos by stringing together shorter clips generated by Stable Diffusion or other means.


A new AI video model from Stability has been released, offering a fantastic tool for creating short video clips from images.

The model is trained to generate 25 frames at a resolution of 576 by 1024, with another fine-tune model running at 14 frames.

Videos generated can run for around 2 to 3 seconds, showcasing stunning fidelity and quality.

Steve Mills' example video demonstrates the high quality of the AI-generated videos.

Topaz's upscaling and interpolation can significantly enhance the output, as shown in a side-by-side comparison.

Stable Diffusion Video's motion control allows for varying levels of speed and dynamics in the generated videos.

The model has a good understanding of 3D space, leading to more coherent faces and characters.

Kaai Zang's example illustrates the model's ability to create a 360-degree turnaround from a series of images.

Stability's example image shows consistent environmental rendering across separate shots.

Pinocchio is a user-friendly option for running Stable Diffusion Video locally, with one-click installation.

Hugging Face offers a free trial for Stable Diffusion Video, though it may experience high user traffic.

Replicate provides a platform to run generations of Stable Diffusion Video with a reasonable pay-as-you-go model.

Users can adjust the frame rate, motion bucket, and conditional augmentation for customized video outputs on Replicate.

For video upscaling and interpolation, tools like R Video Interpolation and a video upscaler can enhance the final product.

Final Frame, an AI image to video tool, has been updated with new features and allows for merging video clips into one continuous file.

Final Frame's timeline feature enables users to rearrange clips for creative video arrangement.

Despite being a project by a single developer, Final Frame is a commendable tool for indie creators and community members.

The creator of Final Frame, Benjamin deer, is open to suggestions and feedback for further improvements to the tool.

Upcoming improvements for Stable Diffusion Video include text-to-video, 3D mapping, and longer video outputs.