How Grammarly’s Kubernetes migration led us down a rabbit hole of proxy denials, until our Production Engineering team discovered that the real villain was hiding in plain sight. As the team responsible for ensuring millions of users can access Grammarly without interruption, we knew that we needed to get to the bottom of this high-stakes infrastructure mystery and do it quickly. In this blog post, we’ll walk you through the exact timeline of our investigation and how we finally caught the villain.

Act I: January’s false dawn

The migration launch—January 6, 2025

When we completed the migration of our text processing platform (i.e., the core services that analyze and improve users’ writing, which we call the data plane) from a legacy container service to our new Kubernetes-based data plane on AWS, we expected the usual growing pains. What we didn’t expect was for one of our production clusters to erupt into a storm of mysterious proxy “denied” errors—just as peak hours started.

To fix this issue, we reached out to Buoyant, the company behind Linkerd, the open-source service mesh that we had deployed to secure and monitor communication between our Kubernetes services. Through our communication with Buoyant’s support team, we realized that the proxy started refusing connections after the main API launched a WebSocket storm. Yet the very same cluster looked healthy as soon as we drained traffic away or rebooted its nodes.

Those first scares planted a dangerous seed: Is there a bug in the service mesh?

What is the “Linkerd denial” error anyway?

Before we dive deeper into our investigation, let’s clarify what these “denied” errors actually represent—this distinction turned out to be crucial to understanding our mystery.

Authorization denies vs. protocol detection denies

When Kubernetes pods have Linkerd’s lightweight proxy injected as a sidecar container (making them “meshed pods”), Linkerd’s authorization policy allows you to control which types of traffic can reach them. For example, you can specify that communication with a particular service (or HTTP route on a service) can only come from certain other services. When these policies block unauthorized traffic, the request gets rejected with a PermissionDenied error.

But that’s not what we were seeing.

Our Linkerd denial errors were actually related to protocol detection failures. When a cluster is under extreme load, the application might not send the initial bytes quickly enough for Linkerd to detect the protocol. In that situation, Linkerd falls back to treating the connection as raw TCP, and all HTTP-specific features are disabled for that connection.

The TCP vs. HTTP authorization problem

Here’s where our confusion began: Linkerd’s authorization policy lets us control which types of traffic are allowed to meshed pods. By default, many setups are configured to allow HTTP traffic but not raw TCP traffic.

We found out that when protocol detection failed under load, Linkerd would fall back to treating connections as TCP. But if our authorization policies only permitted HTTP traffic, these TCP-fallback connections would be denied. This triggered what looked like authorization errors, but were actually symptoms of protocol detection timeouts.

Looking back, the messages themselves weren’t the mystery—protocol detection timeouts and 10-second connection delays are documented Linkerd behaviors. The real puzzle was why our cluster kept hitting this condition so consistently, and why Linkerd kept denying what should have been ordinary HTTP traffic.

A quiet January, a lucky escape

Without a clear solution to the mystery, we decided to reroute traffic to other clusters to buy ourselves more time, which was workable since January is a quiet month. During this lull, we began an effort to optimize and reduce infrastructure costs. In this effort, we completed a full node rotation on the affected cluster, which appeared to “fix” the denies for the rest of the month. At the time, every on-call note ended with the same refrain: “Keep an eye on the service mesh denying allowed traffic,” though in hindsight, this merely masked the real culprit.

Meanwhile, in the Grammarly control plane: “Everyone is a group now”

At the same time, Grammarly launched a new business model, which required a migration that transformed every paying user into a separate group record. The Grammarly control plane services that managed those records suddenly became key dependencies for the suggestion pipeline.

As a result, the extra load made these services brittle. Whenever they stalled, user traffic vanished, autoscalers dutifully scaled the data plane down, and we unknowingly set ourselves up for the whiplash that would follow.

Act II: February’s “deny-storm season”

By mid-February, the data plane felt like a haunted house: Every time we touched it, Linkerd “denies” howled through the logs, and users lost suggestions. Three outages in one frenetic week forced us to hit pause on cost cuts and launch a “data plane stabilization” program. Let’s take a look at each of these outages in more detail.

January 22—The first cold shower

Duration: ~2 hours

The first real outage struck during the evening in Europe, US morning peak. The Grammarly control plane hiccupped, traffic dropped, and the data plane collapsed to half its usual size.

On-call engineers stabilized things the only way they could: pinning minReplicas to peak-hour numbers by hand for the busiest services.

Users barely noticed, but the message was clear—aggressive autoscaling plus flaky control plane equals trouble.

February 11—The lock that took us offline

Duration: ~2.5 hours

Three weeks later, a bulk delete of 2,000 seat licenses locked the database. Requests backed up, and the main API could no longer establish new WebSockets. Autoscalers trimmed the text-checking fleet; when the DB recovered, there weren’t enough pods left to carry the surge, and Grammarly was impacted for 2.5 hours.

Slack erupted with “How do we scale everything up?” messages and frantic readiness probe tweaks. But something else caught our attention: During the scramble, we saw another wave of Linkerd denies—coinciding with CPU spikes on main API service pods.

Cost pressure and a nagging theory

All this unfolded against soaring cloud infrastructure bills. The Kubernetes migration added network charges, GPU-heavy ML models, and a fair bit of overprovisioning. A cost-cutting program kicked off in February, pushing for smaller pods and faster scale-down cycles.

It made perfect financial sense—and amplified every weakness we’d just discovered.

By mid-February, our working theory was as follows: CPU spikes → Linkerd denies → outage.

It felt consistent with Buoyant’s assessment and the charts we saw.

February 19—Redeploy roulette

Duration: 2 hours

A routine Helm chart change redeployed a dozen text-processing workloads right on the European evening peak. The burst of new pods stormed Linkerd’s request-validation path, triggering a two-hour incident, where error rates on text checking peaked at 60%.

We tried the usual dance:

Shift traffic away from sick clusters.
Manually scale CoreDNS vertically on the failing cluster due to some DNS resolution errors in the logs; when it didn’t help, we kept blaming Linkerd TCP connection interception for DNS queries.
Scale the biggest backends horizontally.
Trim the service-mesh pod budget to “give Linkerd some air.”

It worked—but only after we had thrown extra CPU at almost every layer, reinforcing the belief that Linkerd was simply choking under load.

February 24—The main backend self-inflicted pileup

Duration: 2 hours

Four days later, an innocent attempt to move the main text-checking backend pod to a separate node pool accidentally restarted 17 deployments at once, since there are 17 versions of the service deployed in the clusters. Their heavy startup, plus mistuned readiness probes and pod disruption budgets, formed a perfect retry storm: text checking overloaded, and suggestions limped for two hours.

Again, we blamed Linkerd denies, and again, the real fixes were classic hygiene—probe tuning, selective traffic-shaping, and manual upscaling.

February 25—Terraform butterfly effect

Duration: ~2.5 hours

The next afternoon, a failed Terraform apply in the control plane deleted critical security-group rules, severing traffic.

The outage unfolded in two acts:

Control plane blackout (~20 minutes): Re-adding the rules revived logins and billing
Data plane starvation (140 minutes): While traffic was low, autoscalers happily shrunk text-checking services. As a safety measure, engineers decided to scale up all services to 80% of their allowed maxReplicas—which was too much. Not only did it trigger the “Linkerd denies” problem, but it also broke Karpenter. Karpenter, trying to parse ~4,500 stale NodeClaims on each cluster, crash-looped with out-of-memory failures, which prevented any new nodes from launching.

We watched denies spike again during the frantic scale-out, but traffic graphs told a clearer story: The real villain was capacity surge, not the mesh.

Daily 15:30 EET “mini storms”

Meanwhile, every weekday, the main API rolled out during the European rush hours on schedule. As a result, each rollout briefly doubled downstream calls, coaxed Linkerd’s deny counter into the red, and gave us a fresh scare.

February 27—”Stop cutting, start fixing”

By the end of the week, we finally admitted the inevitable:

We could not optimize the infrastructure usage and fix the infrastructure bugs at the same time, so we decided to pause cost optimization on data plane clusters for three weeks.
We opened the data plane stabilization track to hunt the root cause, harden probes and pod disruption budgets (PDBs), audit scaling rules, and figure out the Linkerd issues.

Act III: March—Peeling back the layers until only DNS was left

March 3: Still dancing around the real issue

We rode out another service mesh “denial” wave that slowed every text-processing cluster for about 50 minutes. The post-incident review again pointed at Linkerd overload during a main API redeploy, which we mitigated by simply upscaling services—the same playbook we had used in February.

We took the CPU-starvation hypothesis seriously: Buoyant’s own assessment had shown main API and downstreams pegged at 100% during every connection storm. So, we isolated the API onto its own NodeGroup with generous headroom and paused our cost optimization program. As a result, the March 14 stabilization update proudly reported zero outages for a whole week.

We thought we were winning, but that stability was fake: A new experiment off-loaded traffic to internal LLMs. This meant there were fewer cross-service interconnections during peak hours, and so we weren’t reaching the traffic threshold where we were crumbled. But we didn’t understand this yet.

The plot twist that changed everything

When we investigated this further with Buoyant, their CTO suspected we were “treating the smoke, not the fire.” His intuition proved correct when we discovered that denials may be reported when the first read on a socket returns 0 bytes—when connections were being closed before protocol detection could complete. This pointed to a completely different issue altogether.

This wasn’t about authorization policies at all. It was about network-level connection failures that prevented protocol detection from succeeding in the first place. The “denials” we were seeing were a symptom, not the cause.

March 17: The pattern repeats

An emergency rollback of the main API during peak EU traffic triggered the Linkerd denials problem again. During the rollback, traffic returned to the usual text-processing backends and bypassed LLMs, which had already been seeing decreased loads during experiments since the beginning of March. Denials spiked exactly while new pods registered; dashboards looked painfully familiar.

March 18: Five-hour outage, one accidental clue

Duration: 5 hours

The facade cracked the next day. We had added a scaleback prevention that used the historical number of replicas to mitigate rapid scaling down caused by denies during high traffic periods. However, this scaleback prevention system was “expecting” the main API to be released, because of the release patterns from the previous week. Even though we hadn’t explicitly deployed the API, the system didn’t know that. Instead, it prepared for the scaling behavior from the phantom release from last week. The result was an unleashing of the largest storm of denials we had ever seen, resulting in a five-hour, company-wide outage.

The team performed numerous actions to stabilize: We did manual scale-ups of the busiest backends, pinned minReplicas, restarted the main API, sped up the main API rollout, opened fuse limits, disabled Linkerd on one cluster, and more. But ultimately, what helped was the natural traffic drop after peak hours.

The crucial hint: An AWS representative joined the outage call, confirmed nothing obvious on their side, but mentioned various components we could look at. One of them was CoreDNS, which was the key insight.

CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. When you launch an Amazon EKS cluster with at least one node, two replicas of the CoreDNS image are deployed by default, regardless of the number of nodes deployed in your cluster. The CoreDNS pods provide name resolution for all pods in the cluster.

March 19: The correlation becomes clear

The next day, the team analyzed the CoreDNS graphs. Nothing critical or too suspicious was found, but we decided to scale up the number of pods to 12 on one cluster just in case. In the evening, the familiar pattern started again—except on the cluster with 12 CoreDNS pods. We fanned replicas out to 12 on every cluster, and denials disappeared within minutes.

For the first time, the mesh looked innocent; our DNS layer suddenly looked very, very guilty.

The detective work: Uncovering ENA conntrack limits

Over the following week, the team:

Rolled out NodeLocal DNSCache in production to offload DNS resolution from the centralized CoreDNS to local caches
Prepared the loadtest setup in preprod to reproduce the symptom without users watching
Enabled the ethtool metrics in node_exporter
Started to redeploy the main API in preprod under load until denies started happening

The smoking gun: We observed that the counter node_ethtool_conntrack_allowance_exceeded jumped exactly when Linkerd denials were reported. We were not hitting Linux nf_conntrack limits at all. Instead, we were silently blowing through the per Elastic Network Adapter (ENA) conntrack allowance on AWS for the instances that were running CoreDNS, which mercilessly dropped packets without leaving a kernel trace. Each drop resulted in a cascading chain of failures: DNS request failure, retries, client back-offs, connection closures, Linkerd protocol detection timeouts, and, eventually, the denial.

March 28: Closing the loop

By March 28, we were able to declare success in our epic “data plane stabilization” effort:

CoreDNS fixed at 4 replicas to increase the ENA conntrack capacity
We deployed NodeLocal DNSCache everywhere to distribute the load from central CoreDNS and cache DNS responses.
We added ENA-level conntrack metrics permanently in our dashboards and alerts to catch this issue in the future.

What we learned

Watch the whole stack—Don’t stop at application metrics

Our biggest blind spot was trusting surface-level metrics. We were monitoring CPU, memory, and kernel-level networking, but we completely missed the ENA-level conntrack_allowance metric that signaled the silent packet drops. As a result, we blamed the service mesh for a network device limit that existed several layers below.

In practice: We now monitor ENA conntrack metrics alongside traditional application metrics and have set up alerts tied to these deeper infrastructure counters.

Scale in more than one dimension

CoreDNS had plenty of CPU and memory, but we were hitting per-pod UDP flow limits on the AWS ENA network adapter. Adding more replicas (horizontal scaling) distributed the connection load and solved the problem—something vertical scaling never could have achieved.

In practice: When troubleshooting performance issues, we now consider connection distribution, not just computational resources. We maintain a minimum of four CoreDNS replicas to keep per-pod UDP flow counts below ENA thresholds, and we have a NodeLocal DNS cache on each node.

Operational maturity: Infrastructure hygiene pays dividends

Throughout February and March, we systematically hardened our services: tuning readiness and liveness probes, configuring appropriate PDBs, rightsizing CPU and memory requests, and fixing autoscaling behavior. While none of these fixes individually solved our DNS problem, they eliminated noise from our dashboards and made the real signal visible.

In practice: We now maintain a “service hardening” checklist covering probes, PDBs, resource requests, and autoscaling configuration that new services must complete before production deployment.

“Palette architecture”: The power of identical clusters

Having six identical Kubernetes clusters serving different portions of traffic proved invaluable for both experimentation and risk mitigation. We could test different CPU settings, autoscaling targets, and even risky updates on one cluster while keeping the others stable.

In practice: This architecture became our controlled testing environment, allowing us to isolate variables like CPU limits, separate node groups, and different Linkerd sidecar configurations across clusters before rolling out changes fleet-wide.

Validate suspicions quickly with systematic testing

When we suspected CPU starvation, we immediately isolated the main API onto dedicated nodes and paused cost-cutting measures. While this wasn’t the root cause, it allowed us to focus our investigation elsewhere rather than chasing false leads for weeks.

In practice: We used a “hypothesis → test → verdict” approach for our experiments with data plane stabilizations, documenting what we ruled out as much as what we confirmed.

What we recommend

For AWS EKS users: Monitor ENA metrics

If you’re running workloads on AWS EKS, especially DNS-intensive services, set up monitoring for ENA network performance metrics. The conntrack_allowance_exceeded counter can help you detect connection tracking issues before they impact your applications.

In practice: Enable the ethtool collector in node_exporter using the –collector.ethtool command-line argument. Monitor queries like rate(node_ethtool_conntrack_allowance_exceeded) and alert when they exceed 0.

Resources:

For Linkerd users: Update to 2.18+

Linkerd release 2.18 was heavily influenced by the story we are sharing with you. It has a lot of fixes, and clearer metrics and logs to help you grasp what’s happening on the service-mesh level.

To share a few important ones: Buoyant found that Linkerd was putting a much heavier load than expected on CoreDNS, which was fixed in the Linkerd release 2.18 by PR #3807.

To reduce the “protocol detection denies” and no bytes read reported as a deny error, Linkerd 2.18 introduced support for the appProtocol field in Services, allowing protocol detection to be bypassed when the protocol is known ahead of time. It introduced transport protocol headers to cross-proxy traffic, removing the need for inbound protocol detection entirely as the peer proxy now shares the protocol. Finally, it now exposes different metrics to clearly distinguish between authorization policy violations and protocol detection issues, making it easier for operators to identify which type of “deny” they’re actually dealing with.

Final thoughts

Sometimes the villain isn’t the obvious suspect screaming in your logs. Sometimes it’s the quiet component you took for granted, silently dropping packets at the network level while a perfectly innocent service mesh takes the blame.

The stormy trilogy that started with Linkerd denies ended with a quiet line on a Grafana dashboard—but it rewired how we observe, test, and run on Kubernetes.

And that, finally, feels sustainable.

Acknowledgements and shout-outs

A complex, multiweek incident like this one takes an entire organization to resolve. Our thanks go to:

Platform / SRE team on-call engineers, incident commanders, and experiment leads for round-the-clock firefighting, root-cause sleuthing, the “data plane stabilization” program
Core backend squads for rapid probe, PDB, and rollout-strategy fixes that bought us breathing room
And everyone across Engineering who cleared roadblocks, merged emergency MRs, or kept the night shift company

Your collective effort turned a puzzling outage stream into a stable, better-instrumented, cost-optimal, and scalable platform. Thank you.

The Great Linkerd Mystery: A Three-Act Kubernetes Drama