Why Your DevOps Isn't Reliable

Lecture: Human Factors in DevOps Reliability

Take A Way's from the lecture / Talk i did on the SREDAY of 27 June

Video can be found : https://www.youtube.com/watch?v=M-9eDdoSVDc&list=PLPlIqkCCFmam6XvtE3oIC2l4dBFMKSjjJ

This lecture explores the critical role of human behaviour, communication, and team culture in achieving system reliability within DevOps environments. It emphasises that reliability is shaped not only by technical tools but also by proactive ownership, effective communication, robust processes, and a culture of learning and trust. Real-world examples illustrate how failures often stem from human factors, highlighting the importance of blameless postmortems and organisational support.

Takeaways

Human behaviour is central to system reliability, not just technical tools.
Communication failures are a leading cause of incidents in DevOps.
Process and documentation are protective, not bureaucratic obstacles.
Ownership and curiosity within teams drive proactive reliability.
Learning and regular training are essential for reliable teams.
Incident response reveals true team dynamics and culture.
Trust is foundational; loss of trust undermines reliability.
Well-managed change is protection, not bureaucracy.
Monitoring and metrics require active, team-based engagement.
Blameless postmortems must address both technical and human factors.

Highlights

"Reliability isn't just what your tools tell you. It is what your people do when the tools break."-- Marcel Koert
"Every step you skip in the name of speed is a bet against reliability."-- Marcel Koert
"Process isn't the enemy, it's protection."-- Marcel Koert
"Ownership that is felt by the people that work on it, people that want to take a step more, they want to do things, they want to improve their product."-- Marcel Koert
"Reliable teams are learning teams."-- Marcel Koert
"When trust is gone, reliability is gone."-- Marcel Koert
"Reliability doesn't only live in your technical stack, it also lives in your team and in your people."-- Marcel Koert
"At the end of the day, your system isn't just what runs in production."-- Marcel Koert

Chapters & Topics

Human Behaviour and Reliability

Reliability in systems is not solely determined by technical measures such as SLOs, SLEs, error budgets, or dashboards. The behaviour, decisions, and communication of the people operating and developing the systems are the real drivers behind reliability. Human actions during incidents, under pressure, and in routine operations shape the actual reliability outcomes.

Keypoints
Incidents often result from human decisions, not just technical faults.
Judgement calls under pressure (e.g., at 3 a.m.) can make or break reliability.
Ownership, curiosity, and proactive behaviour are essential for reliable teams.
Team dynamics during incidents reveal the true state of reliability.
Explanation

The lecture emphasises that behind every uptime number is a human story. Teams make judgement calls, sometimes under stress or fatigue, which directly impact reliability. The speaker shares personal experience and industry patterns where communication lapses or rushed decisions led to outages. The focus is on fostering a culture where people feel responsible, communicate openly, and learn from mistakes.

Examples

A critical outage occurred late on a Friday when a developer (referred to as 'Dev') pushed a hotfix to production without communicating, as Slack was quiet. This led to production downtime.

Dev was on call and made a change without informing the team.
Lack of communication meant no one was aware of the change.
Production went down, illustrating how human behaviour, not just technical issues, causes outages.
Considerations
Encourage open communication, especially around changes.
Foster a culture of ownership and responsibility.
Recognise the impact of stress and fatigue on decision-making.
Special Circumstances
If a team member is making changes outside normal hours (e.g., Friday afternoon), ensure there is a clear communication protocol to prevent silent failures.

Communication as a Pillar of Reliability

Effective communication is not a 'soft skill' but a critical operational defence in reliability engineering. Many incidents stem from communication failures, such as unannounced API changes or skipped updates. Communication should be treated as a first-class component of operational processes.

Keypoints
Lack of communication can cause multi-million dollar system failures.
Communication prevents incidents by ensuring alignment and awareness.
Speaking up, verifying, and showing work are preventive actions.
Communication is not optional or secondary to technical work.
Explanation

The speaker argues that communication is often undervalued in technical teams, seen as an afterthought rather than a core activity. Real-world examples show that missing a Slack message or failing to update a ticket can lead to cascading failures. Teams should normalise handovers, status updates, and verification steps.

Examples

A system failed because someone changed an API endpoint further down the line without informing others. There was no ticket, no Slack message, and no stand-up update.

The change was made in isolation.
Lack of communication led to downstream failures.
The incident could have been prevented with a simple update or announcement.
Considerations
Treat communication as a core part of reliability, not an afterthought.
Implement regular handovers and status updates.
Encourage team members to verify and clarify changes.
Special Circumstances
If a change is made that could impact other teams or systems, always communicate proactively, even if it seems minor.

Process, Documentation, and Change Management

Processes such as code reviews, change control, and testing are not bureaucratic obstacles but essential protections against reliability risks. Skipping steps in the name of speed increases the likelihood of incidents. Well-managed change is a safeguard, not red tape.

Keypoints
Process steps catch mistakes that individuals may miss.
Skipping documentation or tests is a risk to reliability.
Change management builds trust and prevents chaos.
Process debt is as dangerous as technical debt.
Explanation

The lecture highlights that incidents often begin with small shortcuts, such as skipping a test or documentation. These shortcuts accumulate and eventually lead to failures. The speaker stresses that process is about preventing landmines, not slowing down progress. Teams should view process as protection, not an enemy.

Examples

The speaker recounts deploying a change on a Friday afternoon, skipping some process steps. Over the weekend, issues arose, demonstrating the risks of bypassing process for speed.

Deployment was rushed to finish before the weekend.
Process steps were skipped, leading to undetected issues.
Problems surfaced later, requiring urgent fixes.
Considerations
Never skip code reviews, change control, or testing.
View process as a safety net, not a hindrance.
Document changes promptly and thoroughly.
Special Circumstances
If under pressure to deploy quickly, pause and ensure all process steps are followed to avoid future incidents.

Ownership and Team Culture

True ownership means team members feel responsible for the product beyond assigned tickets. Teams with strong ownership proactively improve systems, anticipate issues, and respond calmly to incidents. Culture is shaped by how teams handle incidents, learn from failures, and support each other.

Keypoints
Ownership is proactive, not just reactive.
Curiosity drives continuous improvement and learning.
Incident response reveals real team dynamics.
Training is an integral part of work, not an overhead.
Explanation

The lecture highlights that ownership is more than having a ticket assigned; it's about caring deeply for the product and anticipating problems. Teams that foster curiosity and continuous learning are better prepared for incidents. Training is considered a core part of SRE work, essential for keeping teams up-to-date.

Examples

A team deliberately broke their system in acceptance every Friday to learn from failures. This practice reduced their mean time to recovery and increased confidence during real incidents.

The team simulated failures regularly.
They learned from each incident, improving their response.
When real incidents occurred, they were calm and effective.
Considerations
Encourage proactive ownership and curiosity.
Support regular learning and experimentation.
Special Circumstances
If there is rotation of team members, a stable core of senior people is crucial for maintaining the desired culture and reliability.

Monitoring, Metrics, and Blameless Postmortems

Technical monitoring is necessary but insufficient without active engagement. Teams must regularly review and question their metrics, logs, and alerts. Blameless postmortems should address both technical and human factors, fostering a culture of learning rather than blame.

Keypoints
Monitoring must be actively maintained and reviewed by the team.
Regular intervals (e.g., every sprint) for reviewing metrics are essential.
Blameless postmortems encourage honesty and learning.
Addressing human factors in postmortems prevents future incidents.
Explanation

The speaker warns against complacency with monitoring tools, noting that teams often ignore recurring alerts or errors. Regular, team-based reviews are necessary. Postmortems should go beyond technical analysis to explore team dynamics and decision-making. Blame shuts down learning; openness encourages improvement.

Examples

Teams often receive alerts that are routinely ignored because they are seen as unimportant. This leads to missed issues and undermines reliability.

Recurring alerts become background noise.
Important signals may be missed.
Regular review and questioning of alerts are needed.
Considerations
Schedule regular team reviews of monitoring and metrics.
Ensure postmortems are truly blameless and address human factors.
Encourage honesty about mistakes and confusion.
Special Circumstances
If a team member is afraid to admit mistakes, reinforce the blameless nature of postmortems to encourage openness.

Organisational and Team Culture for Reliability

Sustained reliability requires both team-level and organisational support for a culture of openness, communication, and continuous improvement. Teams with a strong, stable core can maintain culture even under organisational pressure, but broader support accelerates progress.

Keypoints
Team culture is critical, but organisational support is needed.
Reliable teams often have a core of senior members.
Cultural change is slow and requires persistence, often taking years.
Openness and communication are key values when hiring for cultural fit.
Reliability is not fixed by adding complexity (more tools/dashboards) but by lessening it and building better conversations, habits, and team dynamics.
The system in production is not just technical components but also how people behave when things go wrong.
Explanation

The speaker shares experience from ING, where some teams have achieved a strong reliability culture, but the broader organisation may not always support it. Teams with a stable core of senior members can maintain culture, but change is slow. Openness and communication are key values in hiring and team building. Reliability lives in the team and people, not just the technical stack. The true system includes human behaviour under pressure.

Examples

At ING, some teams have developed a strong culture of reliability, but this is often limited to the team level rather than the whole organisation. A stable core of senior members helps maintain this culture.

Teams resist organisational pressure to cut corners.
A core group maintains standards and culture.
New members rotate in and out, but the core persists.
Considerations
Support team-level culture with organisational policies.
Hire for openness and communication skills.
Recognise that cultural change takes years, not months.
Focus on reducing complexity and fostering better human interactions rather than just adding more tools.
Special Circumstances
If organisational pressure undermines team culture, empower teams to stand up for reliability practices.
Internal team members may find it harder to take responsibility for failures due to pressure, but a correct team culture enables this openness. External consultants may find it easier to be open about failures.