Articles

Modern systems and applications span numerous architectures and technologies — they are also becoming increasingly more dynamic, distributed, and modular in nature. In order to support the availability and performance of their systems, IT operations and SRE teams need advanced monitoring capabilities. This Refcard reviews the four distinct levels of observability maturity, key functionality at each stage, and next steps organizations should take to enhance their monitoring practices.
Source de l’article sur DZONE

The dreaded part of every site reliability engineer’s (SRE) job eventually: capacity planning. You know, the dance between all the stakeholders when deploying your applications. Did engineering really simulate the right load and do we understand how the application scales? Did product managers accurately estimate the amount of usage? Did we make architectural decisions that will keep us from meeting our SLA goals? And then the question that everyone will have to answer eventually: how much is this going to cost? This forces SREs to assume the roles of engineer, accountant, and fortune teller.

The large cloud providers understood this a long time ago and so the term “cloud economics” was coined. Essentially this means: rent everything and only pay for what you need. I would say this message worked because we all love some cloud. It’s not a fad either. SREs can eliminate a lot of the downside when the initial infrastructure capacity discussion was maybe a little off. Being wrong is no longer devastating. Just add more of what you need and in the best cases, the services scale themselves — giving everyone a nice night’s sleep. All this without provisioning a server, which gave rise to the term “serverless.”

Source de l’article sur DZONE

If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier.

Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter.

Source de l’article sur DZONE

When are you smarter than your playbooks, and when are your playbooks smarter than you?

That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested and that we should stick to them at all costs.

Source de l’article sur DZONE

The application development landscape has fundamentally changed in recent years. In a recent interview with Ambassador Labs, Mario Loria from CartaX said he believes this is still uncharted territory, particularly for developers in the cloud-native space. As he sees it, site reliability engineers (SREs) play a key role in guiding developers through the learning curve toward comprehensive self-service of the supporting platforms and ecosystem, and ultimately to service ownership. This requires a major shift in company and management culture, and developer (and SRE) mindset and tooling as well as insight to make the journey to full lifecycle ownership not just smoother and more transparent but also technically feasible.

Two Worlds Colliding: The Monolith and Service-Oriented Architecture

The traditional monolith continues to exist in parallel with cloud-native application development. The operations side of the equation, according to Mario, understands that this has caused a big shift in deploying, releasing, and operating applications, and now the role of SREs is to help developers understand and own this shift. Developers know how to code, but building in the necessary understanding (and ownership) of the “ship” and “run” aspects of the lifecycle introduces a steep learning curve. For developers, this means taking on new responsibilities with the support of SREs.

Source de l’article sur DZONE

The monitoring and alerting stack is a crucial part of the SRE practices. That’s where BotKube helps you monitor your Kubernetes cluster and send notifications to your messaging platform or any other configured sink. In this blog post, we will be configuring BotKube to watch the Kubernetes cert-manager certificates CustomResources.

What is BotKube?

BotKube is a messaging tool for monitoring and debugging Kubernetes clusters. BotKube can be integrated with multiple messaging platforms like – Slack, Mattermost, or Microsoft Teams to help you monitor your Kubernetes cluster(s), debug critical deployments, and gives recommendations for standard practices by running checks on the Kubernetes resources.

Source de l’article sur DZONE

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone.

Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors. As your organization grows, this will extend beyond product and engineering. At some point you also need to convince the key power-holders in your organization to invest in this transformation.

Source de l’article sur DZONE


‘Learn from the mistakes of others. You can’t live long enough to make them all yourself’ – Eleanor Roosevelt.

Nobody is immune from outages but it’s better to learn from other’s mistakes than from your own. The second quarter of 2020 was marked by several serious outages at prominent services including IBM Cloud, GitHub, Slack, Zoom and even T-Mobile (Source: StatusGator Report). I’m sure you noticed these outages like our team did. I decided to share the lessons we learned from this downtime, hoping we can all grow from it.

Source de l’article sur DZONE

If you’ve spent any time in tech circles lately, there are three letters you’ve surely heard: SRE. Site Reliability Engineering is the defining movement in tech today. Giants like Google and Amazon market their ability to provide reliable service and startups are now investing in reliability as an early priority.

But what makes reliability engineering so important? In this blog, we’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.

Source de l’article sur DZONE


Abstract

Cloud-native applications are a type of complex system that depends on the continuous effort of software professionals that combines the best of their expertise to keep them running. In other words, their reliability isn’t self-sustaining, but is a result of the interactions of all the different actors engaged in their design, build, and operation.

Over the years the collection of those interactions has been evolving together with the systems they were designed to maintain, which have been also becoming increasingly sophisticated and complex. The IT service management model, once designed to maintain control and stability, is now fading and giving place to a model designed to improve velocity while maintaining stability. Although the combination of those things might seem contradictory at first, this series of articles tries to reveal the reasons why the collection of practices that today we know as DevOps and SRE (Site Reliability Engineering) are becoming the norm for modern systems.

Source de l’article sur DZONE