Consider the following two hypothetical Slack exchanges:
Do checkouts seem lower than usual today?
Right now users have a 40% chance of getting a 500 error on the checkout page, and it looks like it’s because of API limits with our new fraud detection provider. Can someone from the trust team hop on a call with me?
The difference in the organizational communication cost of these two questions is substantial.
Consider that in the first, the question is open ended, imprecise, and low cost for the person who wrote it.
If shared in a large Slack channel, tens of people may become confused, begin looking at unrelated systems, and begin asking different sorts of questions to follow up and get at the real question.
Because of the lack of precision and investment in the first question, there must be a follow up conversation, and almost always an investment by people not directly involved in the problem. The organizational cost of an imprecise message is high, especially in a remote setting.
In person, the costs of informal and imprecise communication are lower. We have the ability to receive non-verbal queues, ask immediate follow up questions, and decide how urgent a response should be. And we generally aren’t addressing 50+ people at a time!
But again, in a remote environment, the cost of using imprecise language is magnified, and especially costly. We’re either interrupting, or distracting, or increasing the communication costs across the entire organization every time we speak with imprecision. And we generally don’t have a way of quantifying this ongoing cost.
The Value of Precision
The Perfectionists tells the history of precision in the modern world, and how humans have bootstrapped machines for generating increasingly precise, reproducible objects.
The book opens with the “mother of all machines”, the lathe. When we lay a solid piece of metal horizontally and spin it, suddenly we can cut materials down to a precision of within thousands of an inch. With this tool humanity was freed from the slop of human imprecision, and could begin to build repeatable, precise objects.
The invention of this precision created the industrial revolution. Beginning with better barrels for cannons and guns, the accuracy of the lathe was then applied to the improved steam engine, and eventually, interchangable parts for guns.
The idea of precision powered a revolution in the way we make and view everyday objects.
What struck me in reading the book is that we had to invent the idea of precision for the objects. Over my career, I’ve seen a similar improvement of precision with software tooling.
Tools like Kubernetes now allow us to now define and deploy software as a utility, with a similar level of reliability and consistency we expect out of our electrical grid. Observability platforms allow us to inspect, discover, and communicate changes in post-human scale software systems as they occur.
But as natural language conversations become the interface to large language models, I think it’s especially important for us to take a step back, and reflect on the history of our own precision in the language we use when building that software.
We have precise tools for building and deploying software, but lack them for precise communication of the software creation process. This imprecision becomes problematic when interacting with large language models like GPT-4.
Observability is a Medium for Increased Precision in Technical Conversations
When I began professional software development, industry practices were much less formal.
Web servers were usually updated via an FTP server. When you wanted to make an update of a file, you’d rename the current file with a
.old extension, and then drag and drop its replacement over. You’d check to see if the site was still working, and if it was, move on to the next issue. If it wasn’t you’d delete the new file, and rename the
.old extension back to roll back.
But sometimes after a roll out you’d get a ticket from the customer, and they’d say something about a portion of the site not working.
Usually you’d assume they used an old version of Internet Explorer, and if a cursory attempt to reproduce their issue didn’t work, you’d blame the issue on that. But generally, you’d have like a 30% chance of reproducing whatever issue they were sure of was happening.
This led to a lot of time wasted trying to reproduce errors. As a developer you’d have to establish a probability of whether or not their complaint was valid, and if so, how long and hard you’d try to reproduce the error.
Over the past 10 years, the idea of observability has mostly eliminated this workflow. When a site goes down, we now have context. We’re able to see exactly what is failing, and where.
And everyone has equal access to that information. You can look at a page, get a link to it, and then share it with your team. Everyone can experience the same level of context in a very short time.
Having this level of precision in the conversation is cheaper. We don’t need to go digging for hours before we can find better hints to what may be going on. Observability platforms are decreasing the costs of miscommunication when it matters the most.
The difference between guessing at errors and seeing the exact problem firsthand has a lot of value for organizations. Companies are spending millions (who’d rather not!) to have this sort of visibility, and the vendors providing this service are worth tens of billions.
Precision in a technical context has value.
GPT-4 and the (im)Precision of the Prompt
Again, the imprecision of natural language becomes especially apparent once you spend a non-trivial amount of time attempting to get an LLM (like GPT-4) to accomplish a technical task.
The difference in prompt precision can be the difference between a response that works, and a response that’s useless. Indeed, prepending supporting “facts” to your prompt is one of the key ways we can augment a large language model’s ability to solve problems.
Consider the following two prompts:
Write me a .gitpod.yml for my python project
write a single .gitpod.yml file for a fastapi python project. the application is named main.py, and the command to run it is uvicorn main:app –host 0.0.0.0 –port 8000. the other library dependencies are fastapi uvicorn python-dotenv python-social-auth boto3. be sure to explain your work
We may be able to get a usable answer from the second question. But with the first prompt, we may have an entirely wrong response, and even if the response works, we may need a critical eye before using the generated code because of subtle errors.
And as explained extremely well by Andrej Karpathy, the language model wants to respond using its large corpus. If we want a response from the language model like an expert, we need to be explicit and ask for an expert response from it.
Are Imprecise Language Model Conversations Useless?
One of the early problems I saw in using large language models was their imprecision in responses, from what seemed to me fairly precise prompts.
They’d give answers that seemed right, but didn’t actually say anything.
For example, with a GPT-3.5 prompt to:
write me a python fastapi server that uses sso with ory idp to upload files to an s3 bucket on aws
GPT-3.5’s response includes a line of code that literally says:
# TODO: Implement ORY IDP token verification here to validate the access token
Without any further clarification. In order to actually clarify how to do that
TODO, we’d need to follow up, and assume the other code around it was actually right.
If it wasn’t right, if it was only 90% right, the errors would multiply with every follow up prompt. So our program with 5 prompts has:
.9 * .9 * .9 * .9 *.9 = ~60% chance a unit of code actually works
If we’re building anything non-trivial, an assistant that has a <60% chance of generating the right information doesn’t seem especially useful.
And indeed, if we look at GPT-4’s response to the same exact prompt:
write me a python fastapi server that uses sso with ory idp to upload files to an s3 bucket on aws
We get something that’s just a bit better, an awareness that the answer is incomplete:
# Verify the token with ORY Kratos. # Add your logic here to verify the token using ORY Kratos API. # If the token is invalid, raise an HTTPException. # Mock implementation if token != "your-valid-token": raise HTTPException(status_code=HTTP_401_UNAUTHORIZED, detail="Invalid token")
But really, still broken. But the admittance of brokenness, that’s an improvement! That’s at least clearer communication.
Tools for More Precise Code and Communication
Right now, the way we organize the conversations around code and the code itself is very informal.
In it, he showed how LLMs, code, and conversations around features could all live in the same place:
We’d first begin with a specification of a feature for the large language model to keep track of. The conversations around this feature would then create a channel within the developer’s IDE.
From this channel, we can then invite other people to work on, and discuss the way the feature should work. The conversations about the code live in the same project space as the code itself. We can have a perfect history, and full context of the entire feature’s history:
Once we become aware of the idea of more formal linking of our code and conversations, it becomes apparent that having context and history for each line of code can augment and empower us to be more precise about what we know, and what we say.
Rather than relying on informal conversations with people with imperfect memories, we can use the complete context and code to reason about where and how to approach the problem.
This is an exciting approach.
Why don’t we have a unit of measurement for conversational (im)precision?
We’ve been talking about precision, but one thing we haven’t discussed yet is the cost of that precision in a group setting. Expecting people to have better communication skills may mean that more people self-censor, and there are less conversations in general, leading to less effective communication overall.
Not every conversation needs to be precise. There are places and times to be specific in an organization, like when talking about a new feature, a bug, an idea. It’s in these times we want to be as specific as possible, so we avoid the potential of talking past one another, having the wrong ideas in each other’s heads when we mean the same thing.
One of my favorite pages on Wikipedia is the list of unusual units of measurement. In it, we learn of units like centipawns (measurement of the strength of a chess position in hundreths of a pawn), micromorts (measurement of the chance of death in one in a million), and crabs (measurement of intensity of x-rays in terms of the amount emitted from the Crab Nebula).
Why don’t we have units to measure how formal or informal a conversation is? (The closest I could find was a paper attempting to measure precision using a dictionary of words considered imprecise. We could measure the frequency with which a speaker uses words that don’t mean anything, and give them a formal bullshitter score.)
Github feels like a very formal place to have a conversation about code. It doesn’t feel like the sort of place to bullshit and gather context. Slack feels like a less formal communication channel, and finally Zoom feels like the most personal.
Why don’t we have a word for the likelihood of misinterpretation based upon using the wrong terminology? (We could also have a unit for how many different meanings there are for a single word. Or the likelihood that someone will have a word with more room for interpretation, or…)
And could it be that the process of bidirectional communication itself is a form of computation, and the words you use matter just as much as the internal process of arriving at them?
Conversation as a form of Mutual Computation
The act of writing has been described as a method of formal thinking.
When you take to the written word, you’re forced to make a series of choices, all of which add up to an artifact, a program which runs on the reader’s mind.
What’s most interesting about the written word though, is that the reader is invited to collaborate while reading. They have their own internal dialogue, and their own internal context about the things they read. When we write and read well, it feels like a conversation.
Reading is an act of co-creation, and an act of re-interpretation.
Our human language is fuzzy and imperfect. We must assume it got this way because it has value in being so fuzzy and imperfect.
But when we write code, we are writing it with the assumption that it will be written to accomplish a specific goal. When our code doesn’t accomplish that goal, we don’t deploy it. We rewrite it until it does.
We don’t tend to have the same standards when we communicate with one another. We usually don’t start out with a formal goal before we open our mouth to speak, because human relationships aren’t built to be as transactional as code coldly completing a task.
I expect that if natural language continues to be the interface for large language models, we’ll collectively need to become less sloppy in our methods of communication with one another. I expect we’ll be more deliberate about the words we choose, and I believe there will be new words that enter the collective to describe our level of precision in communication.
At the very least, I expect our conversations will become even more interesting if we all take the time to ask ourselves “how can I make this message more precise”?