Architecture Weekly Issue #125. Articles, books, and playlists on architecture and related topics. Split by sections, highlighted with complexity: 🤟 means hardcore, 👷‍♂️ is technically applicable right away,  🍼 - is an introduction to the topic or an overview. Now in telegram and Substack as well.


Kubernetes resiliency (RTO/RPO) in Multi-Cluster deployments 👷‍♂️

Main disaster recovery objectives are Recovery Time Objective - how much times it requires to get back up and Recovery Point Objective - how much data we are losing during the disaster. Kubernetes though with the StatefulSet feature can be considered a tool to achieve Zero RTO for single-cluster resiliency. Multi-cluster is a bigger challenge. Find what's about it in the article by Avesha.  

Kubernetes resiliency (RTO/RPO) in Multi-Cluster deployments
Ah Kubernetes! The panacea to all our DevOps challenges.

#k8s #kubernetes #resilience

Fault-tolerant Distributed Transactions Can be Fast and Simple 🤟

There is a paper a new Unanimous two-phase-commit protocol which is used by  Spanner, MongoDB, Yugabyte and others with conjuction of Paxos. This pair though requires double N plus 1 nodes to tolerate N failures(so 3 nodes for a single node failure), and U2PC claims to require only N+1. Murat digs deep into the paper in his blog post.

Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple
This paper (PAPOC′24) is a cute paper. It isn’t practical or very novel, but I think it is a good thought-provoking paper. It did bring to...


Making Impact as a Software Engineer 🍼

And the second piece of content from my side! I conducted a webinar on making impact as a software engineer, and how the career growth is connected to business impact. Find the whole video here!


Follow Up

How Grafana used Dapr to improve vulnerability scans 👷‍♂️

If you're buidling containers using 3rd party dependencies, you have to protect yourself from supply chain attacks. It might seem as easy as running a scan against a container only to discover later that scale makes things complicated. You need to add several sources of scanning requests, cache results for performance, handle errors. Grafan tried to solve it using Dapr engine, find what they designed.

How Grafana used Dapr to improve vulnerability scans
Grafana open source software empowers users to query, visualize, alert on, and explore metrics, logs, and traces, regardless of their storage location. Grafana OSS equips users with tools to transform…


Evolving the Backend Storage for Platform Metrics 👷‍♂️

Heroku stores and analyses metrics for it's client for enabling features like auto-scaling. A whole subsystem called MetaaS or Metrics-as-a-Service is designed specifically for this task, including Apache Kafka and Cassandra. While there are no issues with Kafka, operating a Cassandra cluster proved to be tedious. The team decided to migrate data to DynamoDB and after a year, made a conclusion it was the right choice. Find the reasoning inside.

Evolving the Backend Storage for Platform Metrics
A peek behind the curtain at an infrastructure upgrade Heroku completed last year, migrating to a new-and-improved storage backend for platform metrics.

#casestudy #observability

QA myth busting: Quality can be measured 🍼

Testing NFRs is part of an architects job, and another part is to correctly interpret the results. Having the goal of having low coupling or low cyclomatic complexity can screw over your product. Vitaly Sharovatov explores the nature of metrics in QA and explains the proper attitude to them in this longread.

QA myth busting: Quality can be measured
Move past the myth that quality can be definitively measured and learn how to use metrics as signals rather than finite goals.

#qa #quality

The Importance of Using a Composite Metric to Measure Performance 👷‍♂️

To follow up on the metrics, let's consider an example of Indeed: they provide different scenarios of a client web page and demonstrate that you can not possible use a single metric to make a conclusion if the page was slow or not. Instead you need to use a compound metric for a more holistic picture.

The Importance of Using a Composite Metric to Measure Performance - Indeed Engineering Blog
The way we measure web performance has changed significantly over the last decade. As the web has modernized, it’s become more important than ever to refine how we measure performance, and we should be using a composite metric to ensure we’re capturing the full story around speed. Composite metrics…


Discord's Streaming Technology 👷‍♂️

Discord is a popular chat app with the group calls functionality. Discord also is relating to gaming, so streaming games on a call was a natural necessity. In their engineering blog they describe how the stream is coordinated between client apps and Discord servers, how WebRTC helps making it happen and how the performance is measured.

How It All Goes Live: An Overview of Discord’s Streaming Technology
Discord’s Go Live feature provides high-quality screen share between users. Tag along with Audio/Video manager Josh Stratton as he explains how it works.

#casestudy #performance


The brutal and unjustified war against Ukraine continues already 2 years. If you want to help Ukraine directly visit this fund.

Big thanks to Nikita, Constantin, Anatoly, Oleksandr, Dima, Pavel B, Pavel, Robert, Roman, Iyri, Andrey, Lidia, Vladimir, August, Roman, Egor, Roman, Evgeniy, Nadia, Daria, Dzmitry, Mikhail, Nikita, Dmytro, Denis and Mikhail for supporting the newsletter. They receive early access to the articles, influence the content and participate in the closed group where we discuss the architecture problems. Join them at Patreon or Boosty!