Episode 1

March 6, 2025

In the first episode of Raising an Agent, Quinn and Thorsten kick things off by sharing a lot of wow-moments they experienced after getting the agent at the heart of Amp into a working state. They talk about how little is actually needed to create an agent and how magical the results are when you give a model the right tools, unlimited tokens, and feedback. That might be the biggest surprise: how many previous assumptions feel outdated when you see an agent explore a codebase on its own.

Listen on other platforms

Transcript

Quinn: It's like kind of crazy how easy it is to do this and that just is a huge testament to the amazingness of these models.

Thorsten: Yeah... maybe talk about this that I think there's three big pieces right... Hello everybody and welcome to a new podcast called Raising an Agent, this is a limited run special edition podcast. I am Thorsten I'm a software engineer here at Sourcegraph and with me is Quinn, the CEO of Sourcegraph. Hi, Quinn.

Quinn: Hey, Thorsten. It has been so fun hacking with you. Now we get to talk about it.

Thorsten: Yes, that's exactly what this is. To set the scene, for the last two to three weeks, Quinn and I have been hacking on a new AI-fueled code editing tool, an agentic tool, and it's still a very rough prototype. It's still not production-ready, but it is taking shape. And already every day, I think without exception, every day there was one of us saying to the other, "holy shit, have you seen this?" Or, "have you tried this? It did this". "You should try this". So there's a lot of excitement. And I think part of the excitement comes from how good these models are, right? And the other part is, say we put the model in a harness, so to say, and the excitement comes from what they do in this environment. Like if you make the model an agent and you give it tools, like what happens then? What this is, this podcast is, again, limited run, but this is our diary of excitement, so to say. So we want to record what excites us. We want to share what excites us because it's fun to share things that you're excited about. I think, you know, maybe somebody else also gets excited about this and sees the things that we're seeing and maybe, you know, comes back to us with more exciting stuff. So anything to add here, Quinn? I get a lot of excitement talk, but I'm sure you're also excited about this, right?

Quinn: Oh, yeah. I just shared the recording feature that you made, and it's so awesome. And it feels like we're discovering with these new models and the advances, these new ways of using code AI that people haven't thought about. I mean, one thing that a lot of people have is, oh, it's great for starting a new project. But what about if you have a big refactor in an existing project? And you tried something, you actually told it to build something, a new feature that actually turned out to be really good at refactoring. And I've never seen that in any other tool. And it's just, it's mind blowing. At the same time, I want to do this podcast with you because we can be really open about when it doesn't work. And we're in this mode where we can just go fix it or understand why it doesn't work and at least plot out what we got to build in the future to make it work. Because I think you and I, we are both pretty practical and we are using this all the time and we're not letting shitty user experience live for very long.

Thorsten: Yeah, I think. Okay, let's start with the recording thing as a small anecdote, which is one of the things that kicked this off, right? The recording tool is essentially in your editor, you can press record or use a shortcut or whatever. And then we record the changes you make. And then after you stop the recording, you can now send the diff of what you did to the LLM and ask it to do the same thing that you've done, which is sitting in between, say, you have auto edits or inline completion on one end of the spectrum. And on the other end of the spectrum, you have a chat interface where you tell the LLM, do this or do that. So here, you know, it came up, both of us independently coming up with the idea or maybe you influencing me. Hey, I'm doing some really hard refactor. It's hard to explain what to do. And inline completion also doesn't cut it. It would be nice if it could record me or see me do this and then mimic me. So I asked our tool and I said, I want to build this feature. And it built the feature. I only had to tell it, oh, you have some compiler error or something. And then it fixed itself and it built the thing. And I couldn't believe it. So I started it up. And it says, it even gave me a plan of how to try it out. It said, start this up, then run this command to start the recording. And I was 60% sure it didn't work. So I ran the command, started recording, popup shows up, says "recording now". I was surprised. Okay, I did something. I refactored, I think for my test project, I added some debug lines. I stopped recording. Again, popup shows up and says "recording stop". Again, surprising. And then I used the command that it added to, I think you call it apply changes or something, to tell the LLM to mimic the changes that I made. And I hit that button, and what it did was it put a prompt editor in the chat where it had all of the changes that I made as single character changes. So it was, I think, 60, 70 lines of single character changes. And I was sure, you know, it was amusing to see this, but also I was sure that the other LLM that I'm going to send this to wouldn't get it. So I hit return, expecting nothing to work, and it worked. Like it understood the changes that I made based on single character edits. And I recorded a video. I took some screenshots or whatever and shared this. And then I think a day later or something, I was like, we have to record a podcast. So, you know, this is why we're here. Because every day there's moments like this.

Quinn: Yeah. And it's the promise of whenever you got an annoying refactor, you got to do the first 10% and maybe the last 3%, but not the other percents. That's pretty cool.

Thorsten: That's right. Yeah. Yeah, that's good.

Quinn: Like all of this is on this prototype that we're building, which is basically what if you take the latest model that's really good at tool calling, that is really accurate, and you build an editor agent based on that. And what would you actually need in that? You don't need a separate chat. You don't need a multi-file editing mode. You just have the agent. You don't need, maybe, we're not sure, checkpoints and rollbacks. And it turns out you want MCP to be native. And in that world, MCP tools can mutate state outside of what you can capture in a rollback type system anyway. And also you can just tell it to roll back if it makes a change. And so it ends up in two weeks, we got to something where we were having these mind-blowing moments. Not both of us every single night, but I would say at least one of us every night, as you said. And it also is fascinating how easy it was to get to that point. And how it is this really nice architecture where now we can go improve the tools for searching or for editing a file. We can plug in a lot of the tools we've already built at Sourcegraph. And, you know, potentially, I mean, who knows? They got to prove their worth. It's like kind of crazy how easy it is to do this. And that just is a huge testament to the amazingness of these models.

Thorsten: Maybe talk about this, that I think there's three big pieces, right? The tools, the model, and say, the wiring together of it all. Like the integration, like, you know, simplified. What has surprised you about this? The tools and model and the integration. Like, what's the important bit and what isn't the important bit?

Quinn: I think it's those three pieces, and the wiring together is not actually that hard because we got there in, like, two weeks or so. And you were spinning up here at Sourcegraph. Welcome back. And I was not going full-time because I've got some other stuff to do, although I've been trying to minimize that in the last few weeks.

Thorsten: Yeah. But I think the point is we were both really surprised by, you know, we're using Claude 3.7 now. We used 3.5 two weeks ago. We were both really surprised by if you just give Claude some tools, there's not much else. Like, it really knows how to use those tools. And I'm sure there's some stuff in the prompts and whatnot, but it's kind of mind-blowing to see what it does. As an example, we have some tools to edit files. We have some tools to view files. We have a tool to run terminal commands. And so I asked it, I think two days ago or something, I asked it to edit something, some file. And the edit tool call failed because I think it sent us a string and not a JSON object or something like that. And it tried it twice. You can see it live. It's like, "let me try to edit this file". And then maybe that's its voice. I'm like, "let me try to edit this file". And it failed. We sent back an error message. We say, "that's not a valid argument". Then it tries again with like the argument somehow changed. It still fails. And then it decides and writes, "let me try this another way". And what it does is it creates a new file, puts the content that it wants in the other file, completely in the new file, and then runs a terminal command to move the new file over the old file. And that's in the model. You just give it the tools, and it comes up with this stuff. That was mind-blowing to me. That was crazy to see.

Quinn: Yeah, that's right. So what are some of the obvious things that we know we need to build to make this even better, to go way beyond state-of-the-art for the kind of editor-agent thing? What's in your mind there?

Thorsten: I think the biggest realization is mind shift over the last few weeks was pre-tool calling or pre-agents, code editing agents. I had different expectations, and I think a lot of people still have these expectations. And that is, you talk to a model, and you ask it to generate some code, and you write one line, two lines of prompt, and it comes back with 300 lines of code. And then you're like, thumbs down, it didn't do what I wanted. And then you tweak the prompt and do this again. And now my mindset is now that why would I have ever expected this to work? And it works sometimes. But why would I have expected it to work that well? Because this is not what works for a human. Like you and I, maybe on a really good day, can one shot a 300-line file, right? If it's easy. But what we often need is...

Quinn: I would be very skeptical of any human who tried to do it that way, you know?

Thorsten: Yeah, well, I've seen some things. But what we need is, you know, compiler feedback, feedback from the interpreter, tests, linters, squiggly lines, all of that stuff to guide us and to kind of help us understand is the code that I'm writing working? And it turns out if you give a model the same kind of feedback, it gets better at writing code. Like it starts to fix the code that it writes. It gives you 300 lines of code. And then it's like, oh, there's a compiler error. Let me fix this. Oh, actually, this library doesn't exist. So that sounds so trite when you put it like this. Oh, just give it diagnostics or, you know, compiler feedback. But it's such a game changer. And to your question, I think what's really important is, well, the tools, of course, the model, of course, blah, blah, blah, all of this. But give the agent the right feedback at the right time for the thing that it's trying to do. Like that, I think, is the crucial bit. That's the one that we have to get right.

Quinn: Yeah, the feedback loop. Here, this is where we have to be really careful when we think we know better than the model. But we do think that it's going to be really important to constrain it, to force it to satisfy the feedback loop and to check that feedback loop before it proceeds. Because sometimes it will not do that, even if you put it in the system prompt. But you got to be careful about when you think you're smarter than the model, because I think all of... so many of the mistakes that we made in the past when building AI things are where we tried to constrain the model too much and actually letting it go a little bit was better, both because it would work better and because it would be more future-proof.

Thorsten: Yeah. That's the other big thing, right? That the inversion of control. I think you call this like the, we had this top down view of things where the model is kind of, we give it all of the things and all of the context. And that's still right to some extent.

Quinn: It's like a baby bird in the nest and it's tweeting and you know, you, the mama bird spits some food into its mouth.

Thorsten: Yes. And now the view is more, it's a big bird and can catch its own food. You just have to present it with the food somehow, right? Yeah. Great example, I think. So our agent has multiple different search tools. And one of those is, you know, a keyword based ripgrep search. And, you know, with like globbing and whatnot, like what everybody knows when I say ripgrep or grep. And the other thing is a code base search, like a semantic search. And when you ask the agent and say, I think yesterday, I asked her to add some animations to the UI. And I told her to use a specific CSS class and it started keyword searching for that CSS class. So it knows, like it knows, okay, like this will give me a result. Like this is a specific thing in this code base that I can search for. But then it realized, oh, I want to see, I don't know what it was. Like how are these animations defined or something? And this is in our Tailwind config, right? And then it switched to the semantic search and it was like animation definition or something like this, or some semantic thing that it searched for. And I mean, this is part of the prompting, right? We can get to this in a second, but it's also this, if you have different tools, you don't have to necessarily dictate which one to use at which point, but you demodel the sites for itself, which works surprisingly well.

Quinn: Yeah. And in the past, the way you might try to solve for that in a RAG-like system is you say, "oh, let's do intent detection on the user's query. And they're looking for something in the front end. So we should look in these files". And if you don't get that right, then you're screwed because there's no feedback loop. There's no way for the model to come back and get some more but these are highly robust now they are very slow and very expensive. Not like crazy expensive but you and I using it I think in total both testing it and using it we probably incurred like a thousand dollars in less than a month yeah probably and I would happily pay that I would happily pay that for every single dev at Sourcegraph and a ton of our customers would pay that too, but it's expensive. And then when you think about it with a feedback loop, well, why don't we run a thousand agents in parallel and have some kind of fitness function to see which one wrote the code that passes all the tests and is best on this metric, you know, and then take the best. And, you know, so if it was cheaper and faster, we could take this thing that we've built and just scale it up and have even better results. Obviously it's not as simple as that, but in So there's still so many constraints.

Thorsten: I showed somebody yesterday a little demo of this, and I think we burned, I don't know, 15 bucks. I don't know, like 10 bucks, I think. We don't show the prices yet.

Quinn: Yeah, I mean, just your time and their time looking at it is worth more than that. And that's the thing.

Thorsten: Right, right. Yeah, but so we did this, and then after we had to restart our attempt, it did it, and it worked, and wrote a bunch of code and actually fixed the problem that both of us said, both senior engineers, we both said, "I don't know how to do this". Well, I would have to sit down and I think it would have taken me at least two hours to first understand the code base, then see all of this. And so the model did it, you know, spend a bunch of money. But what I then said to this other engineer was, have you seen Cerebras? Like, have you seen how fast it could be? Like I'm, you know, kind of talking out of my ass. I'm not saying.

Quinn: How did you know that it's not the right fix if you are not familiar with that code base?

Thorsten: I, again, coming back to feedback, I basically asked it to, I said, hey, I run this and I run that. Like it was like an authentication server and I ran a command and it says, "yep, you're signed in", you know, with your email, mrnugget, blah, blah, blah. But then it says, "here's the user info". And it says, "unknown, unknown, unknown, unknown". So it was a pretty obvious thing that this is wrong. Like it authenticated me, but the user info was wrong. So I know, and I said it, I said to the model, I said, "fix this. If I run this, do this". And then throughout the conversation, it says, "now try it again, try it again". And then to stay on this track a little bit, you know, this is, we could have a separate section of this podcast called "They're Just Like Us". But so I asked it to fix this and it did some, it wrote some code. And again, come back to one-shotting this. It wrote some code, wrote a lot of code, and I ran it. And it looked like, from what I could tell, it looked like it makes sense to code. But I ran it, and it had a nil pointer exception. Like, it crashed. So I gave it the backtrace. What would a human do? Exactly what the model did. The model said, "let me put some debug statements in". So the model went and looked at the code that it wrote and put debug statements in to find out what was nil and how the values flow through the system. And again, like this is the mindset shift where previously would, I think the expectation was always, oh, this is a model. I'm just going to give it this and it's going to solve this problem for me and at the press of a button. But the way I think about it now, and I cannot believe I'm saying this, but it's truly think of this as like a word processor intelligence in which you put context and then it thinks about it and it's automated. It's not yet faster than, not always faster than you and I are, but it kind of does the same things that we do. Like it puts debug statements in and then I ran it again and I pasted the output to it and it fixed the bug. And then it worked. And back to the point I was trying to make is, you know, Cerebras, like they have like a thousand, two thousand tokens per second or something with some specialized models. I'm not saying they can do this for Claude, but my bet is we will get to this. Like if I have Claude 3.7 running at 2,000 tokens per second, that's a complete game changer. Yeah, yeah.

Quinn: Yeah. I mean, the fact that every single software developer and line of code is not written with the help of an agent that has a perfectly instrumented time travel debugger with 100 millisecond full CI times, like all of this stuff. It's a criminal situation. And it is only software engineering, not fundamental model research that stands between where we are today and that world, which is really cool. Because, I mean, the generalization of adding log statements is a full-time travel debugger in all the code. And you should be able to stop it and say, what are all the values of every variable here? And none of that is hard. I mean, it's hard in that it's a pain to set up, and no one actually does it, like using a real debugger. But it's doable. And with agents, there's actually more of an incentive to actually do that.

Thorsten: Yeah. I think that's part of the, if we're both saying, you said this, I said it, we have to be patient. But at the same time, we're both not patient people. And it's both this, the realization that to some degree, you just have to type faster. Like there's not a lot of big riddles to be solved, right? Maybe I'll regret this. But my postulation is if progress with the models would stop today and the model would stay the same, like you would freeze Claude 3.7 in time and freeze it, we could build so much stuff over the next few years. Like it's not even tapped, you know, like 5% of what it could do. Like it's all in how much context you give it, at what point, how do you guide it, how do you prompt it, how do you build the UIs and the UX around how to approach this model.

Quinn: Yeah. Totally. And there's so much suboptimal stuff going on, like in most tools that people use when coding, the way the tools are described and presented to the model is completely suboptimal. It's not using how that model is expecting the tools to be presented like Claude 3.7 Sonnet. You provide them an API and it does its own system prompt. And that's different from how GPT-4, you know, and o3 and all this expected, but all these tools, they use one kind of generic system prompt. For all the models. And it works, but how much fidelity are you losing? I mean, nobody really knows. And is anyone doing testing evals of tool choice based on the descriptions or tool argument selection based on that? No. And when you get into the world of MCP, when you have servers written by some rando tools that maybe are not exposed at the right granularity, like why is the Slack MCP require the model to first call list channels and iterate over and then get the ID and then provide that to the other one I mean these things they are so early and so suboptimal and what a wonder they work as well as they do but like people got to realize like this very early days and it's it's cool to be here and you know for us um we as a company we've always tried to build things that like 100% of devs at a company use and that's a very different kind of product building something that like you know the five percent uh most online devs use and for us that means we have to make something that every last damn dev can figure out it can't rely on that person being a really good prompter it's got to be like more reliable and it's hard to do that but I think that pushes us in a direction of greater robustness and I think the flip that we're making though with this prototype is there are so many benefits to our customers to having a vehicle to them the latest advances of the models. And one thing we want to do here is to be really opinionated, to say that it is best for our customers. And for this prototype, they are basically opting in if they're using it to wanting to have an experience that is the best model. And that means a few things are overturned. I think it means that multi-model is overturned. There's not going to be a model selector because we just cannot build a single product where you can flip from Claude 3.7 Sonnet to Gemini, why would we support both of those? We would have an opinion on which one is best. And any customer that truly wants the very best is going to be okay using both of those. So, you know, all these kinds of opinionated things, like another thing is we need to be able to rip out a feature. Like let's say that we had added checkpoints and then we found out that people are using a bunch of tools where we cannot roll back after that tool makes a mutation to the outside world. We need to build out checkpoints and we can't have a situation where this customer users were depending on this behavior. And we're already seeing that in a lot of the AI tools that were created like six to nine months ago, where people are getting used to some of the things, and it's hard to change. So part of this podcast also is we hope to explain why we're making the decisions that we do, because change is always hard, but we want to be in a position to have the perfect vehicle for the very best editor agent, whatever that means literally today. And we get early access to models. So, you know, we'll have two weeks or so to furiously code something out if we think something is a better model. So it'll be available to everyone on launch day.

Thorsten: And I think to add to this, you know, less people misunderstand it. I think, yes, we cannot build something for the 5% most online anime avatar, you know, from people. But I do think what we're trying to do right now is really focus on how would we use it. And not imagine some anonymous, gray, stock avatar person working at a large enterprise company that, oh, surely they must ask for SAML support on the first day. No, we're trying to build something that we use and that works for us. And I think then we have to start thinking about how to build it out for the enterprise. But first, let's build something that we use and focus on the stuff that bugs us because chances are that if it bugs us and we think, oh, I wouldn't use this, then right now, chances are others wouldn't use it. And sure, you get to a point where there are features that you wouldn't use, but it's also a time and resources thing here. Like, what do you focus on from the start? So, you know, focus on what we want to use. Does it work for us and then expand it from there but really nail like that the core proposition of this like how well does it work

Quinn: Yeah make it work really well make it work really well you know for us first and then for a few other people and then you know don't add all the bells and whistles that make it infinitely configurable for all these different ways so that you know yeah someone could spend seven hours getting all the right tokens for all their MCPs and that would work like But yeah, when I think of enterprise and making it work for everyone in a company, it's not the SAML features. Those are a necessary evil at some point, but it's about just making it work so damn well and being opinionated. So you don't have to be reading Twitter all the time for it to work really well. Yes. I think let's end it here with two big surprises or things that delighted us.

Thorsten: Just to share some more of this. I'll have one on my list. It has the F word in front of it. But it's just to illustrate why we're excited about this. So yes, two days ago, I think, I asked the agent to update the function definition in Go and add a context context to the definition and then update all of the call sites. So what it would do is it would update the definition, then it gets the compiler errors and sees that all of the definitions are now wrong and all the call sites are wrong and need to pass context in and so on. And what it started to do was it saw the compiler errors and then it started to fix the call sites one by one. And it fixed one, it fixed two, and I think the third one, something failed because of intendation or something, which is still a common thing, like we've got to figure this stuff out. But then it got tired apparently and said, "let me try this a different way". And it wrote a bash script to replace all of the call sites and put context in. And then it tried to run the bash script, which again, it didn't work. But just the fact that you gave it this task and it tries it twice and then gives up and writes a bash script. That's another X in "they're just like us" column, I guess. So what was your most surprising or delightful interaction?

Quinn: I, over the last few days, have been working on kind of the backend code here, not the UI code, but some refactors so that we support streaming from the LLM responses, including tool call streaming. So it starts to invoke the tool before the whole response is done. And that was a slog. That was not fun. And I wasn't feeling like I was getting leverage from this prototype, unlike in the UI code and more like, you know, starting out. So that was, you know, where kind of like the recording feature you made. This was right before it where I found that if I just did like 10% of it, then I could actually say, look at the like Git, you know, status and the Git diff, Git diff head and go and finish the refactor. And it did a bunch of the work. And it was good because I was kind of like feeling guilty in my head. It's like, "oh man, I'm not actually able to use the prototype as much", but it, it did it. And I didn't have to like add any new features there. And it didn't get me 100% of the way there, but it did enough to make me happy.

Thorsten: Yeah. Yeah, the Git, I mean, that's a whole separate topic. I'll keep it short, but this is, again, one of those big surprises that if you give a model a Git commit, even in like a super raw form, there's a lot of value in there. Especially with like when you kind of want to add the context of, hey, previously we did this, so now I want to continue with this. Or, hey, revert this change or understand this change or something like this. Because in that commit, you have ideally a good commit message, but you have multiple files. You see the relation between files. That's a meta signal. You see what changes together. You see which tests are updated. You see which types have to be imported, stuff like this. And the models, like what I did, just like you, I changed something in the UI. And then later on, I wanted to change it again. I said, in this previous commit, we made this change. And it then looks at the commit and understands, okay, so if I change this component over here, I have to change it over here too, blah, blah, blah. And then it takes this into the context and goes off from there. Git commits is the new meta, as I said.

Quinn: Yeah. It'll be interesting to see what we said this time will change as we keep learning. Yeah. But I just love being able to hack with you and be able to use this product every single iteration. And love when I try it out and it's got a new feature for me, like the recording stuff. Super cool. And it's just so fun to be hacking on.

Thorsten: It's super exciting. Super exciting. All right. Let's end here. Diary of excitement, Raising an Agent. First episode. Thank you, Quinn, for doing this.

Quinn: Happy hacking.

Thorsten: Bye-bye.