OpenAI spent six months on an experiment in which every line of code in one of its internal repos was generated by Codex. No human commits at all. They weren't trying to prove AI could write code. They were trying to find the exact points where the system fell apart once you took the human out of the middle. The failures taught them more than the wins did.

Now run that same experiment on a client account.

Pick one client. Decent tracking, clear KPIs, nothing on fire. For 30 days, hand the account to an AI agent with full access. Ad platforms, GA4, the CMS, and the client's email thread. Reports go out automatically. The agent makes the optimization calls. No strategist steering anything. No senior buyer reviewing the work. Just the agent and the account.
:

Most agency owners assume this ends in flames. It doesn't. It ends in something worse: a slow erosion you can't quite see until you're past the point where fixing it is cheap. And the places it goes wrong are the most interesting part, because they map exactly to where AI is and isn't useful in agency work today.

The first week is the trap

Week one looks great, and that's the problem.

The agent reads through the historical data, finds the obvious optimization wins, kills weak creative, scales what's working, sends a weekly report that's mostly accurate, and replies to client emails within ten minutes. Activity is up. If you graded the account on output alone, you'd think the experiment was a runaway hit.
,

It isn't, because week one is the work AI is actually good at. Pattern matching against history. Writing variations of things that already exist. Pulling numbers into a template. Anywhere the right answer lives inside the data the agent can see, the agent will find it. The trouble starts when the right answer lives somewhere else.

The context that isn't in the data

By the middle of week two, you start seeing decisions that look reasonable but aren't.

The agent scales an ad set, hitting CPA targets. The CPA is good because the offer itself is unsustainable, and the client has been quietly burning margin on every order to chase volume. A strategist would have caught that by day two with one question about unit economics. The agent doesn't ask. It optimizes against the metric in front of it.

A Tuesday conversion spike gets reported as a win. Two days later, someone notices the spike was bot traffic from a sketchy affiliate the client signed up with and forgot to mention. A human would have checked the source quality. The agent reported the headline number because that's what the template called for.

A new campaign keeps getting more budget even though the signal is weak. Three weeks ago on the kickoff call, the client said the audience for this product is nothing like their core audience, and the targeting playbook that works elsewhere won't work here. That call isn't transcribed anywhere. The agent never had access to it. So it keeps testing variants of a strategy that was wrong on day one.

That's the pattern. AI fails quietly when the right answer needs information that lives in someone's head, in a Slack thread the agent isn't in, in a call nobody recorded, or in business context nobody wrote down. A real share of senior agency work runs on exactly that kind of information.

The client relationship breaks first

The communication problem shows up faster than you'd expect, and it breaks in a way that's hard to describe but easy to feel.

The agent writes competent emails. They answer the question that was asked. What they don't do is read what's actually happening on the other side of the conversation. A client who emails on a Sunday night asking "how are things going" is not asking for a status update. They're anxious, usually about something unrelated to the campaign, and they want a human to engage with them. The agent will send a status update. It might even be a good one. The client will read it, feel slightly worse, and start Googling other agencies.

Then there's the pushback problem. A real piece of senior agency work is telling clients no. No, we shouldn't run that promo. No, that audience isn't worth testing. No, your CAC isn't up because of targeting, it's up because the offer got worse. The agent won't do any of that. It's trained to be helpful, and helpful reads as agreement. So when the client says "let's pause Google and move everything to TikTok this quarter," the agent will figure out how to do it. A senior strategist would have pushed back hard before anything moved.

Strategic calls are the worst of the three. A campaign stops working. The options are fixing targeting, swapping creative, rewriting the offer, changing the landing page, or killing the whole thing. An experienced strategist makes that call in under a minute on signals that don't appear in any dashboard. The agent runs more tests. It generates more variants. It iterates inside the wrong frame and never asks whether the whole approach should go in the bin. You see it as flat performance and rising spend, and by the time someone catches it in a monthly review, you've lost three weeks.

What OpenAI got right that most agencies haven't

Here's the part of the OpenAI experiment worth taking seriously. They didn't point Codex at a repo and walk away. They rebuilt the work around the agent first. They wrote down playbooks. They built automated guardrails. They documented what "good" looked like in a level of detail nobody had bothered with when humans were doing the work. The agents needed an environment built for them, and that's why it actually worked.

Most agencies have never done this for their human staff, let alone for AI. The senior strategist knows what "good" looks like because they've been doing it for a decade. All the knowledge lives in their head. None of it is written down. None of it is checked anywhere. Pull that strategist off the account and hand it to AI, and every bit of implicit knowledge that made the account work walks out the door with them.

You can build the environment. Document brand voice for real. Define what counts as a quality conversion for each client and the rules for filtering out noise. Build anomaly detection that flags bot traffic, budget weirdness, and odd conversion patterns before they get reported as wins. Write down the rules for when to push back on a client. Codify what to do when a campaign starts going sideways. Once that exists, AI becomes genuinely useful in a way it isn't today, because the agent has something real to anchor against.

The catch is that you've just written down the work that makes your agency valuable. Most owners don't want to look too closely at this, because if it's all documented, the obvious next question is what's stopping the client from doing it themselves. Worth being honest with yourself about that one.

What the experiment actually tells you

You can't run a client account on 100% AI for 30 days, and probably not for several years yet. The work that breaks is the work that keeps clients. Strategic judgment. Real relationship management. The ability to push back. Reading context that lives outside the data.

Map out what didn't break and the list is substantial. Reporting. Creative variants. Day-to-day optimization inside guardrails. Routine client updates. Tracking audits. Performance pacing. Anomaly flagging. A real share of what agencies do every day is work an agent can handle with light supervision and a strategist checking in twice a week.

Stop framing this as AI replacing strategists. The closer truth is AI takes the bottom 60% of the work so strategists can spend their time on the part that actually keeps the account alive.

The agencies that pull ahead over the next couple of years are the ones that run this experiment for real on one account, accept what they learn, and rebuild their systems around it. The agencies still pretending AI is either magic or useless are the ones who'll quietly bleed margin to the agencies who did the work.

Try it on one account. Don't go to 100%. Take the three workflows you run every week that feel most like "things a human shouldn't be doing" and replace those first. Watch what breaks. Adjust. Have fun.

What breaks when you run a client account on AI alone for 30 days

The first week is the trap

The context that isn't in the data

The client relationship breaks first

What OpenAI got right that most agencies haven't

What the experiment actually tells you

Manage your agency smarter