Belgrade, December 5, 16:00 (GMT +1)

Diffusion text-to-video generation methods

The last few years can be described as the heyday of generative models working in different modalities. This report is dedicated to one of the most time-consuming and difficult tasks in this field, namely video synthesis from natural language text (text-to-video) and to diffusion-based approaches to this task.

We’ll discuss the theoretical aspects of the diffusion process, its advantages and disadvantages. We will also talk about the new architecture of Kandinsky 3.0, its training peculiarities, the nuances of collecting, filtering, and storing training data, and the results obtained. We will address the issue of generating videos from text descriptions, discussing creating different types of animations and the end-to-end generation of full-featured videos from text. We will also cover the main difficulties in training and quality evaluation of generative models. In addition, we will talk about the new Kandinsky Video generation model. Finally, we will discuss the use of text-to-video models and their prospects.

Rate the presentation