Random SRE Thoughts | Sensors and Signals

This is a random collection of thoughts I have around Site Reliability Engineering and other adjacent tangential topics. I recently moved out of SRE back into a SWE role, and wanted to write down some of these thoughts while they are still fresh. They are informed by my personal experiences and often just streams of thought.

Should an SRE be a programmer?

A common topic discussed amongst various forums and groups surrounding SRE is the debate over the importance of software engineering skills and programming knowledge for individuals in the role.

In my previous career as a music producer, you do not necessarily need to know how to play any instruments, but it helps. The same can be said of SRE, as there are undoubtedly situations where having experience building software and writing code benefits an SRE collaborating with software engineers, either during the SDLC or when battling incidents. Many producers who are also musicians can go beyond just providing feedback and actually pick up an instrument and contribute to a song, or make recommendations on different ways to play or change a part. Similarly, an SRE who is comfortable reading and writing code might be able to point out optimizations, interpret code to disambiguate system interactions, or identify patterns they have encountered while working with other teams solving similar problems.

It’s a hard question to answer definitively, and I think it depends on the organization and the role of the SRE within it. With that being said, it’s hard to think of a scenario where an SRE would not benefit from having some programming skills, even if they are not the primary developer of a services running in production.

Rorschach Tests - What do you see?

There are some people in the observability industry that talk about how dashboards are becoming a tool of the past, but as an SRE, they were always invaluable to me. When you spend day after day building and interpreting dashboards for a system, you build a visual intuition, or at least I did. The ability to not only interpret a single graph, but piece together what multiple signals together mean, can be the difference between resolving an incident in minutes or hours. Some may argue that LLMs will be able to do all of this for us, but even if that is true, it’s a good idea to have a way to “trust but verify”.

I used to capture interesting “scenes” from dashboards after incidents and turn them into training exercises similar to “rorschach tests”. I would show them to engineers and ask them “what do you see?”. The ability to look at a series of graphs with just a small amount of context about the services depicted, and to be able to turn that into being able to tell a story about what happened is a must have skill for SREs (in my opinion).

When you are explaining a root cause hypothesis amid an incident, or proposing an operational move to remediate a situation, you need to be able to communicate clearly. Using telemetry data in visual dashboards has always been the easiest way to do this (for me), and I would encourage my teams to do weekly exercises where we just explored dashboards and signals and practiced interpreting them and finding “blips” of interest to try and dig into to understand. It’s an interesting group exercise, as different people will have different perspectives and see different signals amongst the chaos.

Cowboy

Sometimes when I think about my previous roles as an SRE, my mental visualization is reminiscent of a cowboy. I don’t mean the kind of “cowboy” describing someone who is reckless or careless, and YOLOs code into production without the proper vetting, but instead someone who tends and herds cattle or horses.

I have a friend who worked at a barn, “cowboy” wasn’t his official title, but that’s what he was. He was responsible for over 20 horses of various ages and sizes at any given time. They all had different personalities and dynamics when they interacted with each other, often influenced by environmental factors. It was not uncommon for new horses to join the stable for a season or two and become part of the herd.

While each horse was trained and raised by a different person who had an intimate bond and understanding of the animal, they might not understand what happens daily when that horse interacts with the others in the new environment. Over the years, my friend observed dozens of horses come and go, and how they fit in, or didn’t. He spent long days with them, picking up on their traits and differences, how they communicated, knowing how to spot when they were sick or injured, and how to rehabilitate them and keep them healthy and in shape. If one of them was slightly “off” in some way, perhaps a limp, or a lack of appetite, he would notice that sort of thing.

As a site reliability engineer, I spent my days among the services and infrastructure in production, where there were similar dynamics of new software or infrastructure coming and going with constantly changing environmental factors. Over time, you similarly pick up on the personalities and signals that indicate when things are out of balance or at risk, and how to shepherd things back to health. When the baseline of a system is well understood, it becomes easier to identify when something is “off”.

There were many occasions where I would work with a team and bring something to their attention that they hadn’t noticed because it was outside the purview of their direct service, or a change so small that it didn’t trip any SLOs… yet. It wasn’t that they didn’t know how to monitor their components, but sometimes slight changes in one area can have unpredictable cascading effects that might be slow to bubble up. Good SRE’s will build a “spidey sense” about these types of things, and can help keep the herd safe, and the wolves at bay.

Incident Radiation

There is a certain amount of stress and anxiety involved with incidents, especially those of high severity. In a way, it seems similar to radiation, where small exposures here and there are of little consequence, but daily exposure will result in mutations of your cells. After a while, you become a different thing. You experience a sort of trauma and the notion of another incident possibly happening can trigger a panic attack. It becomes hard to focus on anything when you get to the point of constantly expecting disruptions.

Why do we get stressed by incidents? Why do we allow ourselves to care so much to the point of such emotional and even physical damage?

Do we worry about the existentialism of the companies we work for and fear for job security when things break? I can’t recall many stories of an incident taking down a company.

Is it because the “cleanup” or remediation is going to add a lot more work to your plate, and you don’t have the bandwidth for it? It can be challenging to plan roadmaps and estimate capacity for SRE teams due to the unpredictable nature of the job and the constant re-prioritizing.

Is it because we want to avoid the emotional embarrassment and shame that goes along with an incident? That it’s somehow our fault that an incident happened at all? Most of the time SREs have limited involvement with the root cause of these issues, but SREs are usually evaluated and have team goals surrounding the reduction of incidents in some form or another. While this type of goal might seem logical to a leader, I think it can result in low morale if there are many incidents happening out of the team’s control, yet SRE is still held accountable for them. While an SRE team’s job should be to reduce incidents, and identify systematic ways to prevent them, depending on the organization it might not be reasonable to expect them to be able to do that from a bottom up approach alone.

It is important that leaders foster the right culture and set appropriate goals and KPIs around downtime. For example, instead of setting a goal of “no incidents”, which is unrealistic and can be demoralizing to miss on, a better goal might be to focus on the team’s ability to respond to incidents (MTTD/MTTR), and how quickly they can right the ship in the face of challenges. This is something that is more in the control of the team and can be measured and improved upon. Instead of dreading the occurrence of an incident, the team can focus on honing their skills and improving their response times, closing gaps in monitoring, and innovating on self-healing auto-remediation strategies.

I have seen posts about how measuring MTTR is not a valuable metric because of the complexity involved in getting accurate measurements, but I don’t know if I necessarily agree.