Why Your DevOps Isn't Reliable

Created on 2025-07-02 09:15

Published on 2025-07-02 12:30

Lecture: Human Factors in DevOps Reliability

Take A Way's from the lecture / Talk i did on the SREDAY of 27 June

SREday

Video can be found : https://www.youtube.com/watch?v=M-9eDdoSVDc&list=PLPlIqkCCFmam6XvtE3oIC2l4dBFMKSjjJ

This lecture explores the critical role of human behaviour, communication, and team culture in achieving system reliability within DevOps environments. It emphasises that reliability is shaped not only by technical tools but also by proactive ownership, effective communication, robust processes, and a culture of learning and trust. Real-world examples illustrate how failures often stem from human factors, highlighting the importance of blameless postmortems and organisational support.

Takeaways

  1. Human behaviour is central to system reliability, not just technical tools.

  2. Communication failures are a leading cause of incidents in DevOps.

  3. Process and documentation are protective, not bureaucratic obstacles.

  4. Ownership and curiosity within teams drive proactive reliability.

  5. Learning and regular training are essential for reliable teams.

  6. Incident response reveals true team dynamics and culture.

  7. Trust is foundational; loss of trust undermines reliability.

  8. Well-managed change is protection, not bureaucracy.

  9. Monitoring and metrics require active, team-based engagement.

  10. Blameless postmortems must address both technical and human factors.

Highlights

Chapters & Topics

Human Behaviour and Reliability

Reliability in systems is not solely determined by technical measures such as SLOs, SLEs, error budgets, or dashboards. The behaviour, decisions, and communication of the people operating and developing the systems are the real drivers behind reliability. Human actions during incidents, under pressure, and in routine operations shape the actual reliability outcomes.

The lecture emphasises that behind every uptime number is a human story. Teams make judgement calls, sometimes under stress or fatigue, which directly impact reliability. The speaker shares personal experience and industry patterns where communication lapses or rushed decisions led to outages. The focus is on fostering a culture where people feel responsible, communicate openly, and learn from mistakes.

A critical outage occurred late on a Friday when a developer (referred to as 'Dev') pushed a hotfix to production without communicating, as Slack was quiet. This led to production downtime.

Communication as a Pillar of Reliability

Effective communication is not a 'soft skill' but a critical operational defence in reliability engineering. Many incidents stem from communication failures, such as unannounced API changes or skipped updates. Communication should be treated as a first-class component of operational processes.

The speaker argues that communication is often undervalued in technical teams, seen as an afterthought rather than a core activity. Real-world examples show that missing a Slack message or failing to update a ticket can lead to cascading failures. Teams should normalise handovers, status updates, and verification steps.

A system failed because someone changed an API endpoint further down the line without informing others. There was no ticket, no Slack message, and no stand-up update.

Process, Documentation, and Change Management

Processes such as code reviews, change control, and testing are not bureaucratic obstacles but essential protections against reliability risks. Skipping steps in the name of speed increases the likelihood of incidents. Well-managed change is a safeguard, not red tape.

The lecture highlights that incidents often begin with small shortcuts, such as skipping a test or documentation. These shortcuts accumulate and eventually lead to failures. The speaker stresses that process is about preventing landmines, not slowing down progress. Teams should view process as protection, not an enemy.

The speaker recounts deploying a change on a Friday afternoon, skipping some process steps. Over the weekend, issues arose, demonstrating the risks of bypassing process for speed.

Ownership and Team Culture

True ownership means team members feel responsible for the product beyond assigned tickets. Teams with strong ownership proactively improve systems, anticipate issues, and respond calmly to incidents. Culture is shaped by how teams handle incidents, learn from failures, and support each other.

The lecture highlights that ownership is more than having a ticket assigned; it's about caring deeply for the product and anticipating problems. Teams that foster curiosity and continuous learning are better prepared for incidents. Training is considered a core part of SRE work, essential for keeping teams up-to-date.

A team deliberately broke their system in acceptance every Friday to learn from failures. This practice reduced their mean time to recovery and increased confidence during real incidents.

Monitoring, Metrics, and Blameless Postmortems

Technical monitoring is necessary but insufficient without active engagement. Teams must regularly review and question their metrics, logs, and alerts. Blameless postmortems should address both technical and human factors, fostering a culture of learning rather than blame.

The speaker warns against complacency with monitoring tools, noting that teams often ignore recurring alerts or errors. Regular, team-based reviews are necessary. Postmortems should go beyond technical analysis to explore team dynamics and decision-making. Blame shuts down learning; openness encourages improvement.

Teams often receive alerts that are routinely ignored because they are seen as unimportant. This leads to missed issues and undermines reliability.

Organisational and Team Culture for Reliability

Sustained reliability requires both team-level and organisational support for a culture of openness, communication, and continuous improvement. Teams with a strong, stable core can maintain culture even under organisational pressure, but broader support accelerates progress.

The speaker shares experience from ING, where some teams have achieved a strong reliability culture, but the broader organisation may not always support it. Teams with a stable core of senior members can maintain culture, but change is slow. Openness and communication are key values in hiring and team building. Reliability lives in the team and people, not just the technical stack. The true system includes human behaviour under pressure.

At ING, some teams have developed a strong culture of reliability, but this is often limited to the team level rather than the whole organisation. A stable core of senior members helps maintain this culture.