10 Things I Didn’t Expect Before Building Generative AI for Six Months

Six months ago, I started working on a Generative AI video editor. I began with the assumption that the newest machine learning models would unlock a new sort of software that wasn’t previously possible.

Of course, I didn’t know what that new software would look like, so I decided to start building with the smallest idea I had. Since then, the number of models applicable to video, audio, and text has exploded. It seems every week there is a new, more efficient model I should rewrite my application for.

Despite 15 years of experience developing backend applications, the past 6 months of working in the AI space has shown me that it’s fundamentally different from the rise of cloud and mobile. There are still very many open questions about what the next generation of applications built using these models will look like, and how they’ll be deployed.

But, I want to share some of the things that have surprised me so far about building AI applications.

Morality is a Top Business and Product Concern When Building AI Products

Given I’m working on a video editor, there are many moral and ethical questions which need to be addressed when I’m choosing what to build.

Models can now clone voices, replace faces, and move bodies in whatever way the user wants. Giving users the tools to do this on their own is not just a question of morality and consent, but also of long term business feasibility.

What happens if a lawsuit blames you for abuse done by users? It’s easy enough to do simplistic blocks for nudity, but what else? And beyond the legal reprocussions, can you live with the potential mob of people who distrust the technology, and believe it’s enabling a loss of personal consent?

There’s no better example for the pontential long term consequences of Generative AI than what’s been happening with the cryptocurrency space. Companies previously operated on fuzzy moral and legal grounds for a long time. The legal system eventually caught up with the ecosystem, and the things which were assumed legal because of a lack of action, weren’t. I expect we’ll see the same with AI products eventually.

The Models Don’t Actually Matter, Because Better Ones Will be Here Next Week

This is a wild insight, because six months ago I would have laughed at the possibility. But the models don’t really matter much in practice.

Building any model is a race to the bottom and a Red Queen’s Race. At this point a lot of incredibly intelligent people are building models with billions of dollars worth of compute and resources. These mega models will be made largely without input from the many GPU poors, and will largely be heavily censored, opaque black boxes for end users.

What remains for the rest of us is to use an ever improving ensemble of smaller models, and to build tools and interfaces atop of them.

And building these interfaces is really where the value lies. Open models will continue to improve, and the state of the art will continue to be pushed. The models themselves must be interchangable, as a better model will come about sooner than you think.

But these models as they exist are currently unpredictable and difficult to understand. Bringing understanding, or at least predictability to models shows potential as a moat. (But of course, if AGI comes, there will be no moats, anywhere, unless you have a source of energy, compute, and water larger than your competitors. Oh and your model is better at stealing their model’s weights and…)

The Immediate Future is Going to Be Weird

At this point, the internet as a source of training data is being filled with generated text and images, and the noise from this model generated output will only continue to grow.

We can assume eventually the output will become so good humans won’t be able to tell the difference from human generated content and not.

What does this mean for us as humans, if the place where we currently have most of our social interactions occur is no longer primarily human content?

Depending on your viewpoint, we may already have the answer. The algorithms for social media platforms are already incredibly sticky and good at capturing our attention.

What if a combination of the algorithms and Generative AI builds the perfect skinner box? Then the algorithms will compete to see who can give the best, most tailored emotional experience currently desired by the user. We then have a carefully orchestrated algorithm to subtly shape the behavior of humanity.

Open Source is in a Fundamentally Weaker Position

When I started out building software, Open Source gave me the tools that I couldn’t otherwise afford as a young person. A compiler was hundreds of dollars, but a Linux CD gave me access to all the tools I’d need to start building, right away.

Fast forward the decade or two I’ve been in software development, and Open Source powered the cloud. Trillions of dollars in economic value were generated off the back of Open Source contributions.

But for machine learning, there are two fundamental constraints which require access to capital to scale. Building a large model can cost millions of dollars at the low end, and grow from there for a state of the art model. Building a home scale computer for training smaller models can cost thousands, especially if you need multiple high end GPUs.

Additionally, the datasets to train on are large! In the past six months I’ve been working, I’ve hit bandwidth caps for my ISP multiple times, just using models. If I were to take in training data from the internet, things would look even worse.

Given the high costs of participation, the barrier to entry for Open Source is higher than it’s ever been. There are only so many deep learning labs capable of working with these large models. This means Open Source models will have less eyes and be smaller, until the hardware becomes cheaper, the data is more evenly distributed, or development is subsidized, either by venture capital or governments.

Incumbents Have Strong Advantages

One of the conventional narratives about startups is that they are small and nimble, and can build things faster than big companies.

But with Generative AI, this just isn’t true.

Building a better model currently requires access to capital and data. Large companies have both.

Building a good end user experience means having tooling around your models for easy exploration and better human understanding of the model’s behavior. Again, incumbent companies already have interfaces built up over years, which can be used to augment data before and after inference.

But here there is a real weakness in how the best of developers have been treated over the past two years.

With layoffs making the rounds, companies have parted ways with some of the best, most talented developers in their rosters. Without them to navigate the boundary between the existing product and new possibilites created by these models, large companies will lose, despite these inherent advantages.

We Don’t Know Where the Moat will Come From

If you look at AI right now, there appears to be relatively few moats.

NVIDIA, of course, seems to have the biggest one. They’ve built the GPUs, but more importantly, also the libraries and software to support researchers and builders. And they’ve been building the infrastructure for them for over a decade.

No other company was making as deep of an investment in the tooling to build accelerated computing, with as consistent of a vision.

Since then, of course, there is the story of OpenAI, who built a model that brought them a billion dollars in revenue in a year.

But how defensible is OpenAI’s moat? Open Source models are catching up, and at OpenAI’s last demo day they showcased products like LaundryBuddy, a far cry from the next step after GPT-4 to AGI.

The truth is, we don’t know where the moat will come from with Generative AI. In the meantime, the pickaxe and shovel companies will do well. Platforms like Modal and Replicate will make ML tooling approachable for developers, and we’ll soon see what the Uber of machine learning looks like.

Robotics Are Probably the Next Moat

Building and testing robots is expensive, as the real world is much more difficult than software to model. A basic robot for automation can start at $20k+, and the development iteration loop can be extremely slow, when you factor in having to test each software change in the real world, and hardware which can break unexpectedly.

To address this, NVIDIA has been building and pushing its next generation platform, Omniverse.

Omniverse is a platform to model and simulate environments. As an example, you could use digital twins, to recreate and test your drone’s performance in a high resolution scan of Seattle.

Using ray-tracing and digital envionments, you can model, test, and more importantly generate realistic training data for your robots virtually, allowing you to run tens of thousands of simulated tests in a digital environment.

Between this and the growth of model capabilities, a sharp team who can navigate the boundaries of physical, cloud, and models should be able to build an iPhone like technical coordination moat. It remains to be seen whether this is a bipedal robot, or something else.

Nobody Can Keep Up with Progress

Even the most intelligent and deliberate of my peers can’t seem to keep up with the speed of advancements in the space. It seems every week we get a new breakthrough, one which may have applications, or contribute to a breakthrough in the current problem we’re solving.

Because of this, it’s easy to develop an underlying unease about our chosen problem spaces. Is it a dead end? Is there somewhere else that might have better results? Is there a completely different architecture I should be chasing?

Being in technology, there has always been an unease about the pace of learning the latest technology. But in the AI space, this feels faster than anything I’ve ever experienced. How you manage to stay focused, while not getting locked into dead ends is a core part of navigating the space effectively.

There Are More Vibes than Hard Data at the Edges

How do you measure the performance of a Large Language Model?

More importantly, how do you measure it against another language model?

Right now there are tradeoffs across the available models, and there are tools to try them all out using the same prompt, to see the difference in results.

But largely, these opinions on the “correctness” of an output for the highest performing models is mostly a gut opinion. And over time, people have opinions that they’ve changed for the worse, while black box model providers insist nothing has changed. (There are, of course, formal tests of a model’s capabilities, but most experts agree these are flawed.)

Because of the relative gap between a “correct” answer, and the one a person personally deems correct, there won’t really be an absolute measure of what the “correct” answer is. For instance, if a street level drug dealer asked your language model questions about strategies for growing their market presence, what response should be deemed “correct”?

Secrecy and Security Matter in Ways They Don’t Normally

What do you think the market value for the raw weights of GPT-4 is?

If someone leaked them as a Torrent (like LLama), how soon would it be before it was optimized to run on consumer hardware?

The current moat of language model companies revolves around the premise of their weights never being leaked. This means they have to trust their cloud providers, their employees, and the security of their systems to protect each layer of their infrastructure, and the entirety of their business.

I’m certain intelligence agencies from all over the world are interested in the applications these of advanced language models and their employees. I also don’t expect these companies to be defending their technology on their own.

Between this and hundred million dollar plus training runs, the upper echelons of machine learning are a bit scary! Throw into the mix people who are convinced the training runs are potentially catastrophic for humanity’s future, and there’s surely to be a bit of intrigue in these companies.

Finding Your Position in the Coming AI Landscape

I was recently interviewed on a podcast, and asked about the future of machine learning.

At the time, I felt uncertain about offering any sort of advice. After six months of working in the space, I didn’t really have any purely optimistic, encouraging advice for people entering the space. There are genuine traps here for builders, and incumbents really do have non-trivial edges in the space, in ways they didn’t for cloud or mobile.

Despite this, I still want to build, and encourage others to do the same. When software began taking over the world, it had the potential to alienate people who didn’t understand how it was built, and thus couldn’t model how it misbehaved.

But AI threatens to do the same to everyone else. Except for a few thousand engineers and researchers, the rest of humanity will be captive to the decisions made about what to prioritize, censor, and mark as the correct answer for these giant models. That’s too important of a collective decision to be left to so few.

Although Open Source and the GPU Poors may not have the same advantages, I believe we must try.

all the artwork in this post by magritte