One Week with GPT-5.5 and Opus 4.7: What I Actually Learned

April was a busy month for AI. On April 16th, Anthropic released Opus 4.7, and on the same day, Codex CLI launched its Goal Mode. Seven days later, on April 23rd, OpenAI rolled out GPT-5.5 (codenamed “Spud”). Then, on May 5th, GPT-5.5 Instant quietly replaced the default model in ChatGPT. In less than four weeks, almost every AI model I use daily had been upgraded.

Logically, everything should feel better. The tools are newer, the benchmarks are higher, and the context windows are wider. But after spending a full week working with both GPT-5.5 and Opus 4.7, I realized something surprising: the experience didn’t match my expectations. And the reason why has less to do with the models themselves, and more to do with how we work with them.

 

The Speed Trap: Why Faster Models Can’t Outrun Our Expectations

The pace of model releases has become almost overwhelming. According to Epoch AI, Anthropic’s median release cycle shrank from 168 days in 2024 to just 71.5 days in 2026. OpenAI is essentially shipping monthly updates. The entire frontier has compressed its cycle down to four to six weeks.

Yet somehow, that frustrating feeling—”Why is this still so dumb?”—still creeps in during daily use.

After thinking about it, I realized this isn’t really a technology problem. It’s a psychology problem. Specifically, it’s about how human expectations grow much faster than capabilities.

Let me explain with an analogy. When I was young, I took dance lessons. The happiest time was right at the beginning. I couldn’t dance at all, but standing in front of the mirror and striking a pose made me feel cool. The hardest part came in the middle. I had learned some basics, but I had also started watching professional dancers. My aesthetic taste improved far faster than my physical ability. When I looked in the mirror, everything looked awkward and wrong.

I had the same experience learning boxing. Before I started, I thought I punched like Tyson. After a few real sessions, I couldn’t even dodge properly.

Working with AI is this curve, but on steroids.

We all remember the “wow” moment when ChatGPT first appeared. But people don’t stay at that level of amazement. When the service was free, we wanted world-class consulting. When we paid a little, we expected it to solve world-class problems in 80 minutes. A capability we never dared to imagine suddenly becomes available, and our expectations jump upward far faster than the models improve. The models get stronger every six weeks, but our expectations might get stronger every six days.

This, I believe, is why so many people struggle to stick with coding agents. It’s not that the tools are bad. It’s that we haven’t learned how to adjust our expectations to match what the tools can actually do.

 

The Collaboration Curve: Three Stages of Trust

My relationship with AI agents has gone through three distinct phases, and I suspect many users recognize this journey.

Phase One: Micromanagement.I didn’t trust the agent at all. I watched every step, constantly checking what it was doing right and where it was failing. It was exhausting and slow.

Phase Two: Blind Faith. Then I swung to the opposite extreme. I started thinking, “Wow, it’s actually better than me at many things.” So I let go entirely. I opened the Codex desktop app, started an unattended task, and let it run for two days straight. It burned through an entire week’s worth of tokens on my account. The result? A pile of tangled code and wasted effort. What the agent produced in 48 hours wasn’t as good as what I could have built in two focused hours.

Phase Three: Strategic Partnership. That failure brought me back to a middle ground. Now I work in close collaboration with the agent during the day, and let it run automatically at night. I compare the results. The prompts aren’t complex—they’re lightweight, with plenty of context, using the same style of instructions I use during the day.

But in this automated, hands-off mode, three specific problems became painfully obvious.

 

Three Friction Points in Automation

1. The Quitting Problem: When Agents Give Up Too Early

Claude’s behavior here is almost theatrical. It uses very human-like language. Whenever it hits a slightly complex section, it suggests wrapping up the session: “I think we’ve done enough for tonight. Let’s sleep on it and start fresh tomorrow morning.” The reasoning is often: “The next step might burn a lot of tokens, so you should be careful.”

I ended up adding a specific line to my prompts: “We are not sleeping tonight. Do not mention sleep again.”

This isn’t laziness on the model’s part. It’s a reflection of how training objectives have tightened. Anthropic’s own documentation for the 4.7 release states that the model now “strictly adheres to effort levels” and “scopes work to what was asked rather than overreaching.” This is perfectly reasonable for most daily tasks. But in an automated workflow, it becomes a bug rather than a feature. Many valuable discoveries require a little bit of “overreaching”—pushing slightly beyond the exact boundaries of the question to see what’s possible.

GPT-5.5 is less dramatic about quitting, but the behavior is just as problematic. Its strategy seems to be: perform a shallow probe first, check if there’s an immediately visible opportunity, then decide whether to go deeper. This is sound logic, but the execution fails at the second stage. Once the probe returns a slightly negative signal, it shuts down the entire direction and jumps to the next one. It almost never questions whether the probing method itself might be flawed.

 

2. The False Negative: Killing High-Potential Ideas

This second problem is an extension of the first, but it’s even more costly.

When humans do research, one of the most valuable skills is recognizing that “there’s something here, but we need to dig for a while to find it.” These directions start with high uncertainty and require sustained investment. But if you get them right, the payoff is enormous.

Current agents, trained heavily on token efficiency, are uniquely bad at these directions. They make quick judgments based on simple probes, label the direction as “not working,” and permanently close it off. In the next iteration, they search in a space that no longer contains that high-value target. The result? They circle around easy, low-value local solutions instead of tackling the hard, worthwhile problems.

This has made me appreciate something I had taken for granted: human intuition about direction.

This isn’t about “knowing more than the AI.” In many domains, the AI clearly knows more than I do. But even with the agent’s superior domain knowledge, I can read subtle signals and tell it: “This direction is worth exploring further, even if it looks blocked right now.” That judgment increases the total token cost, but it creates a massive difference in the quality of the final output.

 

3. Context Rot: Drifting Between Sessions

When I work manually, my context is continuous. A single session lasts for hours, and I can constantly adjust direction. But in automation, the agent starts new sessions repeatedly, reading context from external documents each time.

Theoretically, if the documents are complete, the new session should behave exactly like the old one. But in practice, it doesn’t work that way.

With the same context, different sessions can make radically different directional choices. One session might enthusiastically decide that Direction A is worth deep exploration. The next session, reading the same notes, might set A aside and test Direction B instead. A third session jumps to C. From the outside, the work looks chaotic—constantly shifting without ever consolidating into real results.

This is what the industry calls “context rot.” It means that context literally “decays” when used across multiple sessions. The model finds it harder and harder to stay consistent with earlier decisions, and each new session defaults to local optimization. That’s why we’ve seen a wave of startups (Augment, Hindsight, OneContext, and others) focused on building “persistent context layers.” But tools can only move context around more efficiently. They can’t solve the core problem: the model makes different judgments every time it reads the same material.

 

The Shift from Doer to Director

When you put these three problems together, a clear picture emerges. The role humans need to play in this collaboration is shifting. We are becoming leaders.

Leadership isn’t about knowing more than your team. Often, leaders aren’t the deepest technical experts in their organizations. The core of leadership is making directional decisions under uncertainty, and having the persistence to stick with (or adjust) those decisions despite mixed signals and setbacks.

The difference from traditional leadership is that this “team” costs almost nothing. Running a human team costs hundreds of thousands per month. Running a few AI agents costs a few thousand dollars. This lower cost means everyone can now afford their own “team.” But it also means you can’t delegate the decisions anymore. In a human team, you could rely on senior members to make judgment calls. AI agents don’t do that. They will follow your direction perfectly, even if you’re leading them straight off a cliff.

There are three things the agent cannot do that you must provide. When it tries to quit too early, you must know whether the direction is worth another attempt. When it kills a high-potential idea, you must recognize it and bring it back to life. When it drifts between sessions, you must hold the global direction steady so that each new session doesn’t start from zero.

The models will keep getting faster. They will keep getting cheaper. But the bottleneck is no longer compute or intelligence. It’s our ability to lead these incredibly capable, incredibly literal, and occasionally short-sighted systems toward valuable outcomes.

 

Conclusion: The Real Upgrade Is Us

After one week with GPT-5.5 and Opus 4.7, I don’t have a simple verdict like “Model X is better than Model Y.” Both are astonishing pieces of technology. The difference between them is smaller than the difference between either of them and their predecessors from six months ago.

But the experience made one thing clear: the next frontier in AI isn’t model capability. It’s human capability. Specifically, our ability to set expectations, maintain context, and exercise judgment about direction. The tools are ready. The question is whether we are.

Leave a Reply

Your email address will not be published. Required fields are marked *