Building an AI Video Editor Prototype in 100 Days(ish)

The current iteration of of Generative AI doesn’t feel built for the benefit of artists.

Instead, the focus seems to be on maximizing shareholder value. Training the models used in Generative AI costs large amounts of capital, data, electricity, and water.

The models powering this generation are trained on a giant corpus of art, but with very few exceptions, aren’t compensating the source artists at all.

But it doesn’t have to be this way. Generative AI can be used instead to empower artists and make art more valuable.

Artists and their Relationship with Disruptive Technology

When streaming rose to prominence, the creatives who had created shows and music were mostly left out of streaming deals. Dave Chappelle famously had his show streamed on Netflix without ever seeing compensation, and fought hard to convince executives to take down the show, on the principle of fairness. Taylor Swift also had to negotiate hard to establish fair royalties, pulling her catalog off Spotify completely until Spotify agreed to restructure rates paid to her.

Given this context, generative tools for creatives have to thread a very thin needle. If they are to be embraced by artists they need to show that they can be used to enhance the value of art, and grow the market for independent artists.

Capital tends to push artists where ever possible. Any new technology will inevitably be used to try to put even more financial pressure on artists (who mostly lack capital to defend themselves legally), and maximize larger returns to existing pools of capital.

Storytelling, music, and the visual arts enrich our collective human experience. Generative AI designed poorly has the potential to muddy our main collective commons (the internet) with bland secondary generated content, created en masse for pennies.

Giving Artists a Fighting Chance with Generative AI

Given such a daunting set of challenges coming for artists, where do you begin? If Generative AI is going to disrupt creative processes, how do we ensure artists get a seat at the table?

I don’t have any answers.

But, I’m willing to explore.

I’ve found that when I want to learn about something, it helps to just start builiding a thing. And when it comes to building something new, it’s best to start building with the tiniest possible idea.

Rather than building a giant machine to empower artists, what if we used the existing models to enable new methods of creativity?

My experience is mostly with computer vision, so I started there. I’ve always been a fan of artists like cyriak, and would love to build a tool to make the world a bit more cyriakish.

As a start, I saw a style I was impressed by:

Facebook’s Segment Anything model would be a great tool for making this sort of effect easier.

The original artist made in After Effects, with a lot of patience and manual masking of himself. Of course, Clay is an incredibly talented dancer, and made his own story line to fit the effect too. But(!) the manipulation of the outlines of people and masking is a chore.

A tool to help you creatively explore the segments of your video with better masking would be enough of a start.

So with that, I was off.. (ish)

Building an Open Source AI Video Editor in… Python?

AI video editor

Given the progress of machine learning models, it seemed the shortest path to building a piece of software capable of helping an artist make a video like Clay’s was to try to make as much of it in Python. Python is a lot of things, but I’ve mostly used it for backend infrastructure development, not desktop applications.

So I started by looking for a toolkit to build a video editor in Python, to see whether or not it would even be possible.

It turns out, ModernGL along with ModernGL-Window make for a great way to get an OpenGL interface across Mac, Windows, and Linux. NVIDIA also has a Python library for hardware decoding and encoding of videos when using its video cards. Given the two, it seemed like enough to get started.

Of course, I’d need some way to interact with my videos I’m editing. For that, I used PyImgui to build a user interface on top of my OpenGL window.

Building a Prototype Interface with Imgui

ModernGL with Imgui

With moderngl-window, you’re given a few functions to define and build your render loop:

import moderngl_window
from moderngl_window.text.bitmapped import TextWriter2D


class App(moderngl_window.WindowConfig):
    title = "Text"
    aspect_ratio = None

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.writer = TextWriter2D()
        self.writer.text = "Hello ModernGL!"

    def render(self, time, frame_time):
        self.writer.draw((240, 380), size=120)


App.run()

In your __init__ you can set the resolution of your window, along with any other configurations you may need.

Moderngl-window also comes with an imgui integration. Adding it is easy enough:

import moderngl_window
import imgui
from moderngl_window.integrations.imgui import ModernglWindowRenderer
from moderngl_window.text.bitmapped import TextWriter2D


class App(moderngl_window.WindowConfig):
    title = "Text"
    aspect_ratio = None

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        imgui.create_context()
        self.imgui = ModernglWindowRenderer(self.wnd)
        self.writer = TextWriter2D()
        self.writer.text = "Hello ModernGL!"

    def render(self, time, frame_time):
        self.writer.draw((240, 380), size=120)
        imgui.new_frame()
        imgui.begin("Custom window")
        imgui.text("hello world")
        imgui.end()
        imgui.render()
        self.imgui.render(imgui.get_draw_data())

    def resize(self, width: int, height: int):
        self.imgui.resize(width, height)

    def key_event(self, key, action, modifiers):
        self.imgui.key_event(key, action, modifiers)

    def mouse_position_event(self, x, y, dx, dy):
        self.imgui.mouse_position_event(x, y, dx, dy)

    def mouse_drag_event(self, x, y, dx, dy):
        self.imgui.mouse_drag_event(x, y, dx, dy)

    def mouse_scroll_event(self, x_offset, y_offset):
        self.imgui.mouse_scroll_event(x_offset, y_offset)

    def mouse_press_event(self, x, y, button):
        self.imgui.mouse_press_event(x, y, button)

    def mouse_release_event(self, x: int, y: int, button: int):
        self.imgui.mouse_release_event(x, y, button)

    def unicode_char_entered(self, char):
        self.imgui.unicode_char_entered(char)

App.run()

We’ve added a lot of code, but it’s mostly just passing events through to our imgui instance, so when we click in places imgui knows when and where we clicked.

Imgui uses the immediate mode, allowing you to define your UI directly within a loop, which makes for a bit quicker iteration speed when building a prototype. With it you can build and test new ideas very rapidly.

Loading and Scrubbing Video in OpenGL

Architecture of Video loading with NVIDIA VideoProcessingFramework

Of course, any creative tool is really only useful if it feels real time.

To do this effectively, I used the hardware accelerated decoder built into most recent NVIDIA graphics cards.

This will allow us to use a dedicated chip on the GPU to do the decoding of each video frame, allowing us to seek and play with less latency, leading to better, lower latency playback, usually without having to do conversions of loaded files.

On NVIDIA hardware, you can just use the great VideoProcessingFramework. It even comes with example code, showcasing how to decode a video from the input colorspace to either a Pytorch tensor or an OpenGL texture:

# to seek to frame
src_surface = nvDec.DecodeSingleSurface(nvc.SeekContext(frame_no))

# convert to rgb color space using pipeline
rgb_pln = to_rgb.run(src_surface)

# convert to pytorch tensor
src_tensor = surface_to_tensor(rgb_pln)

# push to CPU, as a numpy array
b = src_tensor.cpu().numpy()

This gives us something we can then manipulate in Pillow or PyTorch / Numpy while prototyping, and save out as frames.

With this, we can then hook up Segment Anything, and use imgui to pick our points within the OpenGL frame. But before we can do that, ModernGL-Window supports MacOS. But MacOS doesn’t have any NVDIA hardware at all.

Building Hardware Video Decoding and Stable Diffusion XL in MacOS

Despite looking for a while, I couldn’t seem to be able to find a way to do hardware decoding on MacOS, despite there being hardware support for it implemented on the M1 and M2 series processors.

So I decided to try and write a Python hardware playback library for MacOS. Apple has extensions written in Objective-C to do hardware decoding, but unfortunately hasn’t released a Python interface too these APIs.

In order to do this, I used Cython, and took from OpenFramework’s implementation of their video player. This allowed me to convert the Objective-C code into C++, and then into a Python API that could play back the video.

After a lot of trial and error, I eventually had a library I could import and run on my MacOS machine, with a reasonable API that mostly matched NVIDIA’s VideoProcessingFramework:

import videoplayback

player = videoplayback.AVFPlayer()
player.load("filename")

# unfortunately haven't made loading a file synchronous yet
time.sleep(.3)

numFrames = player.length_in_frames()
w = player.width()
h = player.height()

destination_frame = 1
# get a frame 
player.seek(destination_frame)
image = np.asarray(player.imageframe())
image = image.view(np.uint8).reshape(image.shape + (-1,))
# shape (1080, 1920, 4)
image = image[:,:,0:3]
b[:,:,[0,2]] = b[:,:,[2,0]]
# shape (1080, 1920, 3)
image = image.copy(order='C')

As for Stable Diffusion XL, Apple has released a repository with some optimizations, making it possible to generate images using hardware acceleration on M1 and M2 hardware.

Unfortunately, these optimizations bring the time to generate a single image to around 2 minutes on my M1 Macbook Pro with 64GB of memory. Not a great feedback loop for creatives. For reference, generating a Stable Diffusion XL image on my desktop computer with a 4090 takes a few seconds.

However, using a service like Modal, I was able to get inference down to a few seconds by using a serverless GPU instance.

Getting Early Feedback

When I showed a version of the editor prototype to a friend, he was excited about the ability to segment objects out of a video, while moving.

So I built out the tools to be able to do this, using the De-AOT model. This allows for predicting up to around the next 20 frames of a video at a time, and mostly works as well as you could ask.

As I built this out, I started seeing some potential uses of Stable Diffusion that could be a tool to empower creatives, rather than imitating their work whole cloth.

Building a Model Pipeline for Asset Generation

Stable Diffusion in Action

Stable Diffusion Asset Pipeline

See normally, Diffusion models aren’t built to live within an existing frame or context. Instead, they’re built to create an image whole cloth from a token prompt, imagining from noise what a picture may look like, iteratively.

In order to generate a transparent asset for a video (like a UFO or an arrow, or…), we’d need to be able to isolate and segment what we want out of an image, hopefully automatically.

With this, we can now see and generate assets:

def generate_segmented_diffusion(object, prompt, negative_prompt="", auto=True, seed=None):
    if seed is not None:
        generator = torch.Generator(device="cuda").manual_seed(seed) 
        pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
            cache_dir="/app/.cache", generator=generator
            ).to("cuda")
    else:
        pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
            cache_dir="/app/.cache"
            ).to("cuda")

    img = pipeline_text2image(prompt=prompt,negative_prompt=negative_prompt).images[0]

    if auto: # just take first mask from SAM
        model = get_dino_model()
        img_ground = transform_pil_image_for_grounding(img)
        b, p = get_grounding_output(model=model, image=img_ground, caption=object, box_threshold=.35, text_threshold=.25)
        boxes = b * 1024 # 1024 x 1024 for sdxl images
        boxes = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
        predictor.set_image(np.asarray(img))
        masks, _, _ = predictor.predict(box=boxes[0]) # just take first for auto
        masks = np.where(masks, 255, 0)
        masks = masks.copy(order='C')
        img_cutout = Image.new('RGBA', (1024, 1024), color=(0,0,0,0))
        mask = Image.fromarray(masks[0].astype('uint8'))
        img_cutout.paste(img, (0,0), mask=mask)
        img_cutout = img_cutout.transpose(Image.Transpose.FLIP_TOP_BOTTOM)
        return img_cutout
    else: # return image with masks
        model = get_dino_model()
        img_ground = transform_pil_image_for_grounding(img)
        b, p = get_grounding_output(model=model, image=img_ground, caption=object, box_threshold=.35, text_threshold=.25)
        boxes = b * 1024 # 1024 x 1024 for sdxl images
        boxes = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
        predictor.set_image(np.asarray(img))
        masks, _, _ = predictor.predict(box=boxes[0]) # just take first for auto
        masks = np.where(masks, 255, 0)
        masks = masks.copy(order='C')
        return img, masks

Rather than creating a final image whole cloth, we can now explore our existing artistic vision with this. (Sidenote, I mentioned at the beginning skepticism with Stable Diffusion and its training data set. I think (I might be wrong!) that incorporating feedback with it like is a better approach than just generating things whole cloth. I’m still not sure here.)

Using a ControlNet to Collaborate with your Image

ControlNets are a way to steer and direct the diffusion process as it occurs. You can train a model to steer output using your own input, so there’s a bit more control of how your diffusion model generates.

For example, you can use a model called OpenPose to show how the people in your images should be posed. You can use ControlNet to show exactly how many people should be in your image, and how they should oriented.

Alternatively, if you’ve got an object you’d like to transform into a diffusion, you can use something like Canny Edge detection to get an outline of your object as part of the steering of the generation.

We can use these models to change assets within our video, while retaining their original proportions. In the video above I’ve turned an orange barrier into a concrete barrier.

But if you look at it closely, there’s still a few issues. Ideally I’d have a virtual 3D asset I could move with the camera’s movement.

Using a Diffusion Network for Video with TokenFlow

But using a ControlNet on it’s own doesn’t work very well for video. If you look at the video on the right, you’ll see you get consistently flickering frames, as the diffusion model guesses a different outcome based upon the input noise. Even if you keep the same initial seed image consistent across frames, the way Diffusion models work means you’ll get flicker.

There are techniques to improve the flicker, a commonly used technique is AnimateDiff, which adds a temporal layer in the middle of the Diffuser. But these techniques aren’t perfect.

The best implementation for temporal stability I’ve seen so far is TokenFlow, which uses a different method called Plug and Play to steer the generation of an image while retaining the spatial meaning. You can see a snipped of the above video with a prompt of “a man surfing a wave”.

Plug and Play uses an initial image which is inverted to noise via DDIM, and then run through the diffusion process, allowing the features relevant to generating the image to be extracted. These features are then injected in the self-attention layers of the diffusion model, while using the same inverted noise of the original image.

By combining this technique with frame to frame correlation, you can then have a temporally consistent video, relatively free of flicker, as shown above. But still, both of these techniques appear very artificial, and have the stigma associated with “AI video”, that sort of artificially generated artifacts.

Building a Model Exporter for ebsynth

Stable Diffusion Controlet for ebsynth

Ebsynth is a non-deep learning method for replacing the textures of your video with another art style. It works by taking an input series of video frames, along with some keyframes that have been painted to match the style you’d like your video to be replaced by.

We can use ControlNet along with Stable Diffusion to take our video, and explore different text based ideas for textures.

In the example here, I used “red hot lava volcano fire flames” as my prompt, as I wanted to create a glowing effect to myself.

For the ControlNet, I detect whether or not there are people present in the image, and if so, add a weighted ControlNet for OpenPose, in addition to the Canny Image ControlNet.

Given all the other tools we’ve already built, adding ebsynth generation is straightforward enough. We need to select a subset of frames, and then run Stable Diffusion on them. This allows us to ensure they all come from the same pathway generated by the diffusion model. I’ve also added a mask mode to the video, allowing us to isolate a specific person for texture creation, versus applying it to the whole video.

Architecture Diagram of Ebsynth and Stable Diffusion

Again, this allows for us to generate special effects, ensuring we have a single effect applied to a single subject in our video.

With this, we can put in our generated versions of our characters back in to our videos.

Building AI for Artists

After having spent a few months using computer vision models and generative AI to build an editor, what have I learned?

Video is inherently multi-modal. You have a sequence of images, audio, and a narrative. Some machine learning models are now capable of working with multi-modal inputs. But the day to day work of video creation doesn’t fit well into a single model just yet. The most interesting results I’ve gotten have come from a mixture of traditional video editing, along with an orchestration of models.

But these tools don’t replace creativity! Creativity is mostly just continuing to show up for work, every day. Some days you show up and the results seem to come easily, and others it feels impossible to do anything.

We’re still so early in the process of Generative AI, that I still can’t really tell what the future will look like. I can see the tools starting to take shape, but it’s not clear yet what will win, or how things will work.

I have hope that we’ll be able to beat the massive pools of capital seeking to replace creatives, and that the creatives will win.

If you want to follow along as I keep exploring, I encourage you to share this article, and sign up below for early access to the video editor.

* indicates required

I’d also love to hear from you if you have any ideas, please reach out via Twitter.

Updated: