Toil Budgets: How We Cut Operational Toil by 60% in One Quarter

2025-01-30 · SRE, Automation, Toil

In Q3 2024, our SRE team of six engineers was spending 68% of its time on operational toil. By the end of Q4, that number was 27%. This article describes exactly how we measured, budgeted, prioritized, and automated our way to that result. No magic, no tooling vendor pitch -- just disciplined application of toil reduction principles that any team can replicate.

What Counts as Toil (Google's Definition, Applied)

Google's SRE book defines toil as work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. This definition is precise enough to be useful but broad enough to be misapplied. We found it critical to distinguish toil from overhead and engineering work.

Toil: Manually rotating TLS certificates every 90 days. Running a script to scale up capacity before a known traffic event. Reviewing and approving boilerplate permission requests. Restarting a pod that OOM-kills every Tuesday because the memory limit has not been tuned.
Not toil (overhead): Sprint planning, on-call handoffs, postmortem review meetings. These are necessary operational activities but are not automatable and do not scale with service count.
Not toil (engineering): Writing a Terraform module, designing an alerting strategy, investigating a novel failure mode. These produce enduring value.

The most common mistake we see is classifying all operational work as toil. Responding to a novel incident is not toil. Responding to the same alert for the third time this week because the underlying issue was never fixed -- that is toil.

Measuring Toil: Time Tracking and Categorization

You cannot reduce what you do not measure. We implemented a lightweight time tracking system using a shared spreadsheet (later migrated to a simple internal tool) where each engineer logged toil activities daily. The logging requirement was intentionally minimal:

Date       | Engineer | Category              | Minutes | Description
2024-07-14 | A. Kim   | cert-rotation         | 45      | Rotated wildcard cert for prod
2024-07-14 | A. Kim   | capacity-scaling      | 30      | Pre-scaled worker pool for sale event
2024-07-14 | R. Patel | permission-requests   | 60      | Processed 8 IAM access requests
2024-07-15 | M. Jones | incident-remediation  | 90      | Restarted payment-svc (OOM, 3rd time)
2024-07-15 | R. Patel | deploy-support        | 45      | Helped team with failed canary rollback

After two weeks of logging, we had enough data to categorize toil into buckets and quantify the time spent in each. The results were sobering:

Permission and access requests: 22% of toil time
Certificate rotation and secret management: 15% of toil time
Pre-event capacity scaling: 14% of toil time
Deployment support and rollback assistance: 13% of toil time
Repeated incident remediation (known issues): 12% of toil time
Data exports and ad-hoc queries: 10% of toil time
Environment provisioning: 8% of toil time
Miscellaneous manual tasks: 6% of toil time

Setting Toil Budgets Per Team

Google recommends that SRE teams spend no more than 50% of their time on toil. We set a more aggressive target: 30%. The reasoning was straightforward -- at 68%, we had almost no capacity for reliability engineering work, which meant we were accumulating technical debt that would generate even more toil in future quarters.

We set the budget at the team level, not the individual level, because toil distribution is inherently uneven. On-call engineers absorb more interrupt-driven toil during their rotation, while off-rotation engineers can focus on automation projects. The team-level budget accommodates this natural variation.

The budget was expressed as a hard constraint in sprint planning: out of each two-week sprint's capacity (6 engineers x 10 days = 60 engineer-days), no more than 18 engineer-days (30%) should be spent on toil. The remaining 42 engineer-days were allocated to engineering work, with at least 50% of that (21 engineer-days) dedicated to toil-reduction automation projects.

Prioritizing Automation ROI

With limited engineering capacity, we could not automate everything at once. We ranked toil categories by a simple ROI formula:

ROI = (hours_saved_per_quarter) / (estimated_hours_to_automate)

Category                  | Hours/Quarter | Est. Automation | ROI
Permission requests       | 211           | 80              | 2.6x
Certificate rotation      | 144           | 40              | 3.6x
Capacity scaling          | 134           | 60              | 2.2x
Deploy support/rollbacks  | 125           | 120             | 1.0x
Repeated incident fixes   | 115           | 90              | 1.3x
Data exports              | 96            | 30              | 3.2x
Environment provisioning  | 77            | 100             | 0.8x

We prioritized by ROI, starting with certificate rotation (highest ROI at 3.6x), data exports (3.2x), and permission requests (2.6x). Deploy support and environment provisioning, despite being painful, had lower ROI due to the complexity of automation and were deferred to the following quarter.

What We Automated (and How)

Certificate Rotation

Previously, an engineer manually renewed wildcard certificates via the CA's web portal, downloaded the cert files, updated Kubernetes secrets in each cluster, and triggered rolling restarts of affected services. This happened every 90 days for 12 domains.

We deployed cert-manager with Let's Encrypt ACME integration and DNS-01 challenge validation via our cloud provider's DNS API. Certificates now auto-renew 30 days before expiry with no human involvement:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: sre@company.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - dns01:
          cloudDNS:
            project: company-dns-prod
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-prod
  namespace: istio-system
spec:
  secretName: wildcard-prod-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - "*.company.com"
    - "*.api.company.com"
  renewBefore: 720h

Implementation time: 4 days. Time saved per quarter: 144 hours. This single automation item gave us back nearly one full engineer-month per quarter.

Permission and Access Requests

The permission request workflow was: engineer submits a Jira ticket, SRE reviews, SRE manually runs Terraform to grant IAM roles, SRE comments on the ticket. We processed roughly 35 requests per week.

We built a self-service system: engineers submit a PR to a permissions repository containing a YAML file specifying the requested role, scope, and justification. A CI pipeline validates the request against a policy engine (Open Policy Agent), requires approval from the resource owner (not SRE), and applies the Terraform change automatically on merge. Temporary access includes an expiry field that triggers automatic revocation via a scheduled pipeline.

# permissions/teams/payments/a-kim.yaml
requests:
  - role: roles/cloudsql.viewer
    resource: projects/prod-payments
    justification: "Debugging query performance for PAYMENTS-4521"
    expires: 2024-08-14
    approved_by: j-park  # team lead, not SRE

Implementation time: 12 days. Time saved per quarter: 211 hours.

Capacity Scaling

Before marketing campaigns and sale events, product teams would request SRE to pre-scale services. This involved manually adjusting HPA min replicas, node pool sizes, and database read replicas. We implemented event-driven autoscaling using KEDA with scheduled scalers and Prometheus-based scalers:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-gateway-scaler
spec:
  scaleTargetRef:
    name: api-gateway
  minReplicaCount: 5
  maxReplicaCount: 50
  triggers:
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 6 * * *"
        end: "0 22 * * *"
        desiredReplicas: "20"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_per_second
        threshold: "1000"
        query: sum(rate(http_requests_total{service="api-gateway"}[2m]))

Product teams can now add scheduled scaling events by submitting a PR with a KEDA ScaledObject configuration, without SRE involvement. Implementation time: 8 days. Time saved per quarter: 134 hours.

Data Exports

Support and analytics teams regularly requested ad-hoc data exports -- CSV dumps from production databases, filtered log extractions, metric reports. We built a self-service query interface using Apache Superset connected to read replicas, with pre-approved query templates and row-level security. Queries that match approved patterns execute immediately; novel queries require one-click approval from the data owner. Implementation time: 5 days. Time saved per quarter: 96 hours.

Cultural Resistance

Two forms of resistance emerged that we did not anticipate.

Toil as identity. Some engineers had built their professional identity around being the person who "keeps things running." Reducing toil felt like a threat to their role. We addressed this by reframing: the goal is not to eliminate operational expertise but to redirect it from repetitive tasks to engineering challenges. The engineer who spent 15 hours per month rotating certificates now designs and maintains the cert-manager infrastructure -- a more complex and rewarding responsibility.

Product team pushback on self-service. Product teams initially resisted the self-service permission system because submitting a PR to a YAML file felt more cumbersome than filing a Jira ticket and having SRE handle it. The key counterargument was turnaround time: the old process averaged 4 hours (waiting for SRE to pick up the ticket). The new process takes under 15 minutes from PR submission to access grant. Once teams experienced the speed improvement, resistance evaporated.

Results

After one quarter of focused toil reduction:

Toil percentage dropped from 68% to 27%, below our 30% target.
Engineering capacity increased by 41 percentage points, enabling us to complete three reliability projects that had been deferred for two quarters (service mesh migration, observability pipeline overhaul, disaster recovery testing).
Mean time for permission requests dropped from 4 hours to 14 minutes.
Certificate-related incidents dropped to zero (previously 2-3 per quarter due to manual rotation errors or missed renewals).
On-call burden decreased measurably: interrupt-driven tasks per on-call shift dropped from an average of 11 to 4.
Engineer satisfaction scores improved from 3.2 to 4.1 (out of 5) in the quarterly survey.

The 60% reduction in toil was not the result of heroic effort or expensive tooling. It was the result of measuring rigorously, prioritizing by ROI, and protecting automation capacity from being consumed by the very toil it was meant to eliminate. The discipline of the toil budget -- treating automation work as non-negotiable sprint allocation -- was the single most important factor in our success. The toil budget is now a permanent part of our quarterly planning, and each team reports their toil percentage in sprint retrospectives.