Created on 2025-08-19 12:32
Published on 2025-08-29 10:30
If you work anywhere near reliability or operations, you’ve probably felt the cognitive whiplash. On one hand, threat intel feeds are lighting up with zero-days on edge devices, lightning-fast intrusions, and ransomware crews moving at “SRE-page” speed. On the other hand, many teams are genuinely better at detection, recovery, and not paying ransoms. So are we hurtling toward catastrophe, or finally building some muscle memory?
The answer, inconveniently, is “both.” And that tension is exactly where SRE and DevOps leaders need to operate—turning fear into runbooks, dashboards, and practiced drills that raise organizational reliability and security.
It’s difficult to argue the internet got calmer in 2024–2025. Verizon’s 2025 DBIR highlights a surge in vulnerability exploitation as an initial access vector, driven by zero-days on VPNs and other edge devices. Ransomware showed up in 44% of breaches (up from 32% the year prior), even as median payments declined, and third-party involvement in breaches doubled to 30%. Human factors remained involved in roughly 60% of breaches—meaning the way we design systems, processes, and help desks still matters as much as our firewalls.
Meanwhile, adversaries are getting faster. CrowdStrike’s 2025 Global Threat Report recorded the fastest eCrime breakout at 51 seconds, and reports an average breakout time under an hour—compressing defenders’ decision windows to minutes, not days. If you can’t detect and contain in that first hour, you’re effectively spotting the fire after the roof has caught. (CrowdStrike)
And while law enforcement punched back—the multinational disruption of LockBit in February 2024 was a milestone—the ransomware ecosystem proved, yet again, that it mutates quickly. LockBit’s infrastructure was seized, affiliates were arrested, and decryption keys released, but copycats and rival crews fill power vacuums fast. Takedowns buy time; they don’t end business models. (Europol, Department of Justice, The Guardian)
Finally, nation-state and geopolitical operations continue to blur motivations. The DBIR notes growth in espionage-motivated breaches, and European authorities observed sharp rises in disruptive, politically tinged attacks through 2024—a reminder that reliability and safety of critical services are firm targets in times of tension. (AP News)
Now for the good news: measured across many organizations, some indicators are bending the right way.
IBM’s 2025 Cost of a Data Breach pegs the global average breach cost at $4.4M, a 9% decrease from 2024—driven by faster identification and containment and greater use of security automation. That’s not victory, but it is movement in the right direction. (IBM)
Mandiant’s M-Trends shows global median dwell time hovering around 10–11 days—not stellar, but lower than 2022—and in ransomware cases specifically, adversaries often “self-announce” within a few days, which is grim but detectable. Other incident-response datasets report even more aggressive reductions where managed detection and response (MDR) is engaged. The practical read: teams instrumented for early detection really can buy back time. (Google Services, SOPHOS)
Behavior is changing, too. Verizon reports 64% of victims refusing to pay ransoms, and the median payment has dropped to $115k. In the UK, one survey found just 17% of enterprises paid in 2025, with many leaning on immutable and air-gapped backups. Whether it’s policy, practice, or both, more teams are betting on recovery rather than capitulation. (IT Pro)
Finally, governance is catching up. NIST CSF 2.0 formalized a new Govern function, making security oversight and accountability first-class, while NIST SP 800-61r3 updated incident response to align with CSF 2.0’s lifecycle—bridging prevention, detection, response, and recovery as one continuous reliability problem. In the EU, NIS2 and DORA push resilience and operational risk management from “nice-to-have” into “must-have” for many sectors. (NIST Publications, Digital Strategy, EIOPA)
Ask any on-call engineer: the pager doesn’t care whether an incident starts with a kernel panic or a credential-stuffing attack. The same practices—clear SLOs, good telemetry, quick isolation, safe rollback, rehearsed recovery—decide whether customers feel pain.
A story I’ve seen variations of: an SRE team receives an unusual spike in outbound traffic from a service that “never” talks to the internet. Their alert fires not on malware signatures but on an SLO-derived anomaly tied to data egress. Within minutes they flip a feature flag to quarantine a subset, rotate credentials, and drain a compromised node pool. A later postmortem reveals an infostealer-harvested set of test credentials used by a third-party contractor. The fix wasn’t heroic; it was boring: least-privilege service accounts, passkeys for admins, and a KEV-driven patching sprint for an over-exposed VPN appliance. The lesson: reliability tooling, when pointed at security outcomes, shortens mean time to detect and contain.
That human piece—how a help desk authenticates a caller, how a runbook nudges choices under pressure—still drives many outcomes. The MGM/Caesars incidents from 2023 are famous for a reason: a single well-timed social-engineering call can cascade into days of downtime. SRE leadership can’t “fix humans,” but we can design systems that make the safe path easy and the unsafe path impossible, especially for password resets and privileged actions. (TechTarget, Specops Software)
So is cyber risk inflated or understated? The DBIR’s expansion of vulnerability exploitation and third-party risk says threats are getting sharper. At the same time, IBM’s 2025 cost data and declining ransom payments suggest resilience tactics are paying off. The synthesis isn’t “split the difference.” It’s to accept that adversary tempo increased, while defender muscle improved—and then to make deliberate trade-offs with SRE-grade discipline.
Where I see orgs struggle is misallocation, not apathy. Tool sprawl and overlapping agents create alert fatigue and brittle handoffs. That’s not resilience; it’s toil. Consolidation around telemetry, identity, and runtime containment platforms—paired with SRE practices—delivers more than another blinking dashboard. (InformationWeek, Petri IT Knowledgebase)
Most real breaches still feature credentials somewhere. Make the easy path the safe path: roll out passkeys/FIDO2 for admins and high-risk roles first, then for everyone. Bake identity checks into help-desk flows so “reset my MFA” requires hardware-bound proof, not sympathy. CISA’s guidance showcases how large agencies made the shift; it’s very doable with modern IdPs. Watch your error budget—not for uptime, but for unsafe authentication events—and burn it down with deliberate rollout gates. (CISA)
Not all CVEs are equal. Use CISA’s KEV catalog to drive a weekly “kill list,” and track a security SLO such as “% of KEV issues remediated in X days,” with tighter targets for perimeter and VPN devices. Verizon’s 2025 DBIR called out edge device zero-days and slow remediation; that’s now a reliability risk. Pair this with immutable infra practices so rolling out patched images is a standard deploy, not a one-off fire drill. (CISA)
If you already do chaos engineering for reliability, extend it to adversarial scenarios: expired tokens, stolen service keys, sudden egress spikes, or a fake “ransom note” to test comms and escalation. Use blameless postmortems to land durable changes. Security chaos engineering has matured; treat it like a first-class learning tool rather than an extracurricular. (Google SRE, O'Reilly Media)
Breakout time can be under an hour; your instrumentation should beat that. Promote TTD/TTC to SLOs alongside availability, and alert on burn rates. A one-minute anomaly in outbound traffic or a denied egress policy can be the difference between a contained incident and a headline. Use your existing observability stack to aggregate identity events, network egress, and workload behavior; you don’t need a separate universe of dashboards to do this. (CrowdStrike)
The best ransom negotiation is a good restore. Immutable snapshots, offline copies, and tested recovery RTO/RPO targets transform fear into a runbook. The fact that more organizations are refusing to pay isn’t bravado—it’s practice. Schedule quarterly restore game days; measure not just “did it work?” but “how many humans and how many hours?” Keep those numbers trending down. (IT Pro)
The DBIR shows third-party involvement doubling. Treat vendors like services with SLOs: clear auth patterns (no shared creds), explicit egress rules, and auditable logs. Ask for their incident response alignment to NIST SP 800-61r3and NIST CSF 2.0; this isn’t paperwork—it's about shared drillability. (NIST Computer Security Resource Center)
In Europe, NIS2 raises the floor for essential and important entities, while DORA pushes financial firms to demonstrate digital operational resilience—precisely the space SRE knows how to measure. Map your reliability artifacts—SLOs, error budgets, postmortems, incident command playbooks—to NIS2/DORA obligations. That reuses muscle you already have, rather than spinning a parallel “compliance” universe. (Digital Strategy, EIOPA)
There’s a legitimate critique that fear drives overspending and tool accumulation without outcome improvements. I’ve seen teams carry dozens of overlapping agents, each creating alerts and none reducing time-to-contain. Consolidation and platformization can cut costs and shrink dwell time—if you make TTD/TTC a top-level metric and stop rewarding “number of alerts processed.” (SiliconANGLE, IBM)
On the flip side, underestimating risk is expensive in public. The MGM and Caesars incidents showed how one well-crafted phone call can turn into nine-figure losses and months of regulatory scrutiny. The middle ground is not moderation; it’s measurement. (Reuters)
Picture an SRE on point during an otherwise quiet Tuesday. A new alert triggers: “Unusual egress from svc-invoice-api to unfamiliar ASN.” They follow the playbook: isolate the deployment slice via service mesh policy, rotate JWT signing keys, trigger a credentials sweep, and open a security incident with the same IM ritual you’d use for a Sev-1 outage. A counter in the channel shows TTD at six minutes, TTC at 23. Later, analysis reveals initial access via an infostealer-harvested credential from a BYOD laptop months earlier. The remediation backlog now includes passkeys for finance admins, disabling local passwords for help-desk resets, and a KEV-driven update to the VPN headend. No heroics; just practiced moves that close the window adversaries depend on.
Are your most important security SLOs as visible to executives as your availability SLOs, and do they drive real trade-offs the way error budgets do?
If an attacker had 51 seconds from foothold to lateral movement, which single telemetry signal would tip you off first—and who would see it? (CrowdStrike)
When your help desk receives a “locked-out” call from an executive at 02:00, what phishing-resistant check makes the unsafe action impossible? (CISA)
Which top KEV item would you remove from exposure this week if you could only pick one—and what stops you from doing it tomorrow? (CISA)
If your cloud database were encrypted by noon, how many hours—human-hours, not wall-clock—would a clean restore actually take today?
Cybersecurity is not a horror story or a victory lap. It’s the reliability work you already believe in, with an adversary on the other side. The DBIR’s uncomfortable trends and IBM’s encouraging cost curve can both be true. The trick is to use SRE’s superpower—measurable, rehearsed, human-centered operations—to keep shrinking detection and containment times, to make “unsafe” workflows impossible, and to turn regulation into leverage rather than drag.
When the next incident comes—and it will—your customers won’t ask whether it was “security” or “reliability.” They’ll ask how quickly you noticed, how calmly you recovered, and what you changed so it doesn’t happen again. Let’s build for that.
Verizon, “2025 Data Breach Investigations Report – Executive Summary,” https://www.verizon.com/business/resources/reports/2025-dbir-executive-summary.pdf. Verizon, “2025 Data Breach Investigations Report,” landing page, https://www.verizon.com/business/resources/reports/dbir/. CrowdStrike, “2025 Global Threat Report,” press page, https://www.crowdstrike.com/en-us/press-releases/crowdstrike-releases-2025-global-threat-report/. CrowdStrike, “2025 Global Threat Report,” overview, https://www.crowdstrike.com/en-us/global-threat-report/. IBM, “Cost of a Data Breach Report 2025,” https://www.ibm.com/reports/data-breach. Mandiant (Google Cloud), “M-Trends 2025: Executive Edition,” https://services.google.com/fh/files/misc/m-trends-2025-executive-edition-en.pdf. Mandiant (Google Cloud), “M-Trends 2025 Report (full),” https://services.google.com/fh/files/misc/m-trends-2025-en.pdf. Sophos, “Active Adversary Report 2025,” press release summary, https://www.channelinsider.com/security/sophos-active-adversary-2025/. ENISA, “Threat Landscape 2024,” https://www.enisa.europa.eu/publications/enisa-threat-landscape-2024. NIST, “The NIST Cybersecurity Framework (CSF) 2.0,” Feb. 26, 2024, https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.29.pdf. NIST, “SP 800-61 Rev. 3: Incident Response Recommendations and Considerations for Cybersecurity Risk Management,” Apr. 3, 2025, https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r3.pdf. European Commission, “NIS2 Directive: securing network and information systems,” https://digital-strategy.ec.europa.eu/en/policies/nis2-directive. ESMA, “Digital Operational Resilience Act (DORA),” https://www.esma.europa.eu/esmas-activities/digital-finance-and-innovation/digital-operational-resilience-act-dora. EIOPA, “DORA – entered into application on 17 Jan 2025,” https://www.eiopa.europa.eu/digital-operational-resilience-act-dora_en. CISA, “Known Exploited Vulnerabilities (KEV) Catalog,” https://www.cisa.gov/known-exploited-vulnerabilities-catalog. CISA, “Implementing Phishing-Resistant MFA (fact sheet),” https://www.cisa.gov/sites/default/files/publications/fact-sheet-implementing-phishing-resistant-mfa-508c.pdf. Google SRE Book, “Postmortem Culture: Learning from Failure,” https://sre.google/sre-book/postmortem-culture/. Google SRE Workbook, “Implementing SLOs,” https://sre.google/workbook/implementing-slos/. Europol, “Law enforcement disrupt world’s biggest ransomware operation,” Operation Cronos (LockBit), https://www.europol.europa.eu/media-press/newsroom/news/law-enforcement-disrupt-worlds-biggest-ransomware-operation. U.S. DOJ, “U.S. and U.K. Disrupt LockBit Ransomware Variant,” Feb. 20, 2024, https://www.justice.gov/archives/opa/pr/us-and-uk-disrupt-lockbit-ransomware-variant. AP News, “Ransomware group LockBit is disrupted by a global police operation,” https://apnews.com/article/0297653ddfc245fcdf7d9308c6c1e6fe. ITPro, “Ransomware victims are refusing to play ball… just 17% have paid so far in 2025,” https://www.itpro.com/business/business-strategy/ransomware-victims-are-refusing-to-play-ball-with-hackers-just-17-percent-of-enterprises-have-paid-up-so-far-in-2025-marking-an-all-time-low. Axios, “Caesars Entertainment is latest casino chain to confirm it was hit by a cyberattack,” https://www.axios.com/2023/09/14/caesars-entertainment-is-latest-casino-chain-to-confirm-it-was-hit-by-a-cyberattack. Specops, “MGM Resorts: How hackers hit jackpot with service desk social engineering,” https://specopssoft.com/blog/mgm-resorts-service-desk-hack/.
#SRE #SiteReliability #DEVOPS #Cybersecurity #Resilience #NIS2 #DORA #ZeroTrust #Ransomware #IncidentResponse #SecurityChaosEngineering #DevSecOps #Observability #PhishingResistantMFA #KEV