What I’ve Learned in the Past Year Spent Building an AI Video Editor

An Unexpected Year Spent in AI

Last year I was let go after just 6 months in a new role.

I had left a great company and boss to take a chance on a startup, and before I’d even begun, it was over.

I decided to take the event as an opportunity, and explore what was now becoming possible in video with LLMs, Diffusion models, and the growing number of other open models.

See, years ago I’d helped build a generative video editor that became a unicorn, and had ideas from then I’d wanted to see built.

These ideas were mostly unreasonable back in 2015, but given LLM and computer vision model progress, were now becoming possible.

So I initially focused on building a local video editor improved with multi-modal artificial intelligence. It used computer vision to detect, extract, and track objects in video, combined with Diffusion models to add and animate new objects in to videos.

I’d previously done daily video sketches 5 years ago, using Mask-RCNN, experimenting with skateboard videos:

View this post on Instagram

A post shared by Kirk Kaiser (@zothcorp)

These video sketches previously allowed me to explore the medium of AI-assisted video editing, without any strong expectations.

I assumed building a tool to continue exploring this work would prove fruitful:

The Editor at Work

And indeed it did! I was soon playing with video as a more fluid medium, one that felt a bit more editable. I began to understand how the new vision models worked, and how the GPU could be used to speed up rendering, inference, and video.

By using a combination of models I was able to create a prompt for adding unique, diffusion generated objects into videos, already masked off.

You can read about that process in a previous blog post.

A Sidequest for Safety (Not the AI Kind)

But as I was building out computer vision pipelines and prototypes for the video editor, I experienced a string of tragic local deaths.

Bicyclists and pedestrians kept getting hit by cars.

So as a side project, I started researching cyclist safety, and soon discovered just how terrible the pedestrian infrastructure is in the United States. I wondered if there wasn’t maybe a technical solution to reduce or eliminate these deaths, as the statistics showed they’re rapidly increasing.

So on a whim I put together a proposal to address this using artificial intelligence and robotics, and submitted it to the NSF’s SBIR program.

To my surprise, they invited me to submit a formal, Phase I proposal.

If accepted, this meant I could get up to $2 million to pursue and develop my technology, without the government taking any equity.

So of course, I did that.

It took a few months worth of work, and brought me far out of my comfort zone.

But! One of the conditions of submitting my proposal was that I had to pause all of my open source work related to the project, as the government couldn’t give me a grant for work already done.

This frustrated me, as pedestrians continued to die in my town. I felt guilty for each additional person who got hit.

As a consolation, since I’d already done the tedious paperwork to form and qualify a company to accept government grants like the SBIR (via SAM.gov), most of the labor to create another proposal was already taken care of.

So I also submitted an SBIR proposal to the Department of Transportation for Complete Streets AI.

This proposal imagined using smartphones and computer vision to help fill in gaps of pedestrian infrastructure knowledge at the DOT.

6+ months later, I’d found out the final answer from both the NSF and DOT:

Being Rated by Anonymous Reviewers Sucks.

“No.”

(On a positive note, this means I’m once again able to be public about this work, and solicit help.)

Taking a Step Back From the Obvious

And after six months of working on the local video editor, I also hit a wall.

It became apparent that AI as a layer “on top of” the existing video editor workflows didn’t make much sense, given how different ML workflows incorporating large language models had become, and how much engineering had gone into everything else around modern, flagship desktop video editors.

More powerful vision and audio models could of course be used to add features that reduce toil in existing workflows, but the underlying assumptions behind the user interface of the video editors seemed to be constraining the discovery of potential new methods for video creation, and more importantly, the evolution of video as a medium.

It seemed the process of video creation itself had to be rethought, using the power and possibility of LLMs, multi-modal embeddings / search, and computer vision / diffusion models as a collaborator.

Which led me to a thought:

What if video was more personal? More maleable? More collaborative?

This meant rethinking current video editing workflows according to the strengths of these models.

Building A Generative Video Platform

So I went back to the drawing board, and rethought the whole concept of a video editor. When I did, a thought occured to me:

What if instead of a single video out of a video editing process we had a video generator out of the video editing process?

What if instead of a single, static artifact at the end of editing, we had a video generator, capable of rendering each video tailored to the viewer? Created on demand, according to unique the specific needs of the viewer?

Generative Video Pipeline

What if we allowed for the user to collaborate in the experience of their video?

I’d imagined videos would no longer be static outputs, and would instead be like code: dynamic and generated specifically for the viewer, or a specific audience, video becoming more of a medium for play and interaction, rather than the typical, passive consumption model.

Creating a Dynamic Video Generation Pipeline with Promptflow

With that, I started building a new prototype of generative video using Microsoft’s LLM framework Promptflow. It allows mixing calls to an LLM with code, building up whole graph based pipelines for generative AI workflows.

You define these workflows via a yaml file, in which you can express your variables you’d like passed in to your prompts to your ChatGPT tool. The results from these prompts can then be passed back to Python, or used to gnerate more LLM calls.

With this tool as a basis, I built an initial Horoscope video generator, using the most basic generative approaches to video generation. It took a prompt, injected the user’s variables, called out to an LLM to generate a video script, and then generated images, transformed them into videos, added a voice narrator, subtitles, and finally, put together a video edit.

The flow looks something like this:

Early Promptflow Prototype

In this generator, the Video Generator takes in an Astrological Sign, a Date, and a Random Seed. (LLMs are known to have difficulty in generating random numbers.)

These are all used to run a pipeline that generates a unique, on demand video horoscope reading for the user.

From this initial prototype, I immediately ran into a few limitations.

The Promptflow design expected users to build wrappers on a service like ChatGPT, with mostly static flows. Think of things like customer service bots, using RAG to fill in dynamic information necessary to answer a query.

This didn’t match the level of dynamic video generation and editing processes I envisioned getting to. The design and writing of these static flows didn’t feel like the right layer of abstraction.

So I switched my Promptflow over to a different workflow execution engine, Temporal.

Generative Workflows with Temporal

Temporal allowed me to restructure the Generative processes I was building with durable execution as a primitive.

Workflow of Video Analysis

Rather than building for static graphs of execution, I could build out individual tools / processes, and later allow for the user to decide how to link and execute the tools together for their specific process. (In Temporal, these are called “Workflows”, and come with automatic retries and more).

With these as a base, writing ML generation workflows becomes a bit more straightforward. I could define Activities as discreet units with retry, and synchronize the execution of these Activites via Workflows.

To give an example, let’s say we want to analyze a video file that has been uploaded. (We want to know what’s in the video as, as well as generate embeddings, text, and tags.)

There are many places where a failure could occur. A web request may fail to download a file, a file may have an unsupported encoding, or a GPU machine may not be schedulable. Each of these events would normally require logic to handle failure, have a set number of retries, and decide how to gracefully fail.

With Temporal, we instead set our number of retries and failure conditions across each activity. The Temporal platform automatically handles retries when things go wrong:

Workflow automatically retrying 10 times

This is especially useful during development, where if an error occurs, I can usually spot and fix it, and then restart the worker.

The workflow then retries from the last succesful execution, and usually completes.

This substantially speeds up the development flow for my graph execution workflows.

The Challenges of Building with (sometimes unpredictable) LLMs

Creating language model prompts for software workflows is a fuzzy process.

Designing a prompt requires significant time investment, to understand how your chosen model performs for your specific use case. And even once you’ve decided on a prompt, each model seems to have it’s own quirks about how it interprets and decides whether and how to to follow instructions.

This means one prompt may end up being more appropriate for one specific model, versus another.

It’s tough to tell ahead of time if one model may be more appropriate for your task than another.

People try to address this by writing evals.

Evals are tests to see whether or not an LLM comes up with an appropriate answer. You can either write evals using code, or by asking the LLM to determine whether or not its answers are correct.

To help with this, Anthropic now has what it calls a “Workbench”, from which you can use Claude to generate and analyze specific prompts:

Analysis for your test set in prompts

Using Workbench allows you to get a feel for how you can approach your chosen prompt task, and what sort of outputs you can expect while developing.

You can quickly evaluate the performance of these generated prompts against one another within the user interface.

Thanks to the example prompts from Workbench, I added a process for generating prompts into my Video Generator, using Anthropic’s metaprompting example from Github.

These metaprompts are a great starting point for helping users get started with a prompt template to build from.

Amazingly, asking an advanced model to generate its own prompts seems to mostly work.

Embeddings Might Not Be the Solution You Think They Are

Vicky Boykis has written an amazing, free book on building embeddings models.

Prior to reading it, I assumed vector databases would dominate search and retrieval for anyone working with LLMs. The hype sold in 2023 was that vector databases would be the future of information retrieval.

But as I began working with embeddings and vector databases, the results didn’t seem to match to the hype.

As I dug in, I discovered this is because embeddings are fundamentally a compressing technology, squashing the unique features in your dataset into a fixed length vector output, across the embedding space.

How well these dimensions map to the data related to your business use case depends on how well you’ve built your embedding space.

But! Most people getting started aren’t training their own embeddings models for their specific use case, and are instead relying on off the shelf, generic embeddings models to apply to their business problems.

This blind application of generalized embeddings models over traditional search can lead to worse results, and less easily debuggable systems.

Let’s take a concrete example.

A Song Search Example

Years ago I built a search engine for songs.

Search Example

One of the challanges I faced was bootstrapping relevant results, for very generic search terms.

See, it turns out most song titles aren’t very unique, and so naive text search is terrible on song names, albums, and artists.

Because of this, a generic text embeddings model would be especially challenged to give decent results.

Say our user is searching for the term “stop”:

There may be tens of thousands of songs, albums, and artists with the word “stop” in it.

How do you begin to determine which ones should be most relevant?

To solve the problem, I turned to music top charts. These have been published since the 40’s, and contain some of the most important songs, culturally.

By adding a weight or bias score to the songs previously in the top charts, I could help bootstrap an initial search system.

If I had instead started with an embeddings model, I’m not sure I would have as easily built a solution. Maybe an off the shelf embedding model already has partial knowledge of the top charts, but how much?

Similarly, in building an automatic video editor, I’ve discovered it’s necessary to have a mix of embeddings models, along with traditional search ideas, and a bit of domain specific ideas / experimentation.

The Wonderful, Totally Great Process of Building Something New

Any time I see someone finish a thing, I try to go out of my way to congratulate them.

Getting anything out the door always includes an unseen number of challenges, and entropy works against all of us.

So of course, as I’ve worked towards a vision of building something new the past year, I’ve taken a few detours.

There is a great quote from Jenson Huang, when asked about whether or not he’d start NVIDIA again, he says he wouldn’t, because he now knows how difficult it is:

Similarly, over the past year I’ve wondered if I’ve been too selfish, too naive in attempting to build something new on my own, rather than build off my existing success and luck, and playing it safe with a full time job.

I don’t know the answer yet, but I am grateful for the chance to find out.

Share on

Twitter Facebook Google+ LinkedIn

What I’ve Learned in the Past Year Spent Building an AI Video Editor

Kirk Kaiser

An Unexpected Year Spent in AI

A Sidequest for Safety (Not the AI Kind)

Taking a Step Back From the Obvious

Building A Generative Video Platform

Creating a Dynamic Video Generation Pipeline with Promptflow

Generative Workflows with Temporal

The Challenges of Building with (sometimes unpredictable) LLMs

Embeddings Might Not Be the Solution You Think They Are

A Song Search Example

The Wonderful, Totally Great Process of Building Something New

Share on

You May Also Enjoy

$2 million for your idea with no equity– writing your first SBIR application

Building a Robot to Protect Cyclists from Bad Drivers

10 Things I Didn’t Expect Before Building Generative AI for Six Months

Building an AI Video Editor Prototype in 100 Days

What I’ve Learned in the Past Year Spent Building an AI Video Editor

Kirk Kaiser

An Unexpected Year Spent in AI

The GPU Crunch and Local First, Multi-Modal Generative AI

A Sidequest for Safety (Not the AI Kind)

Taking a Step Back From the Obvious

Building A Generative Video Platform

Creating a Dynamic Video Generation Pipeline with Promptflow

Generative Workflows with Temporal

The Challenges of Building with (sometimes unpredictable) LLMs

Embeddings Might Not Be the Solution You Think They Are

A Song Search Example

The Wonderful, Totally Great Process of Building Something New

Share on

You May Also Enjoy

$2 million for your idea with no equity– writing your first SBIR application

Building a Robot to Protect Cyclists from Bad Drivers

10 Things I Didn’t Expect Before Building Generative AI for Six Months

Building an AI Video Editor Prototype in 100 Days