A simple sloppy software failure can cost an organisation a lot of money, but that can be avoided with easy changes to architecture and design. This new edition of ‘Release It’ was released in January 2018. It illustrates how to design systems that operate longer with limited failures and regain control when things go wrong. This book is a must-have guide to engineering production systems.
HOW THIS BOOK HELPED US?
This book helped us avoid risks that cost companies extensive money in downtime and reputation. Eighty per cent of the product life-cycle cost is in production.
THE BOOK EXPLAINED UNDER 60 SECONDS
The updated version of Release it! Design and Deploy Production-Ready Software discusses the production of modern systems—large, complex and heavily virtualised systems. Includes insight on chaos engineering, the discipline of implementing zero-downtime updates and continuous delivery, and making cloud-native software robust. Analyse approaches to architecture, design and create software—primarily distributed systems.
TOP THREE QUOTES
- “Software design as taught today is terribly incomplete. It talks only about what systems should do. It does not address the converse—things systems should not do. They should not crash, hang, lose data, violate privacy, lose money, destroy your company or kill your customers.”
- “Most testers I’ve known are perverse enough that if you tell them that “happy path” through the application, that’s the last thing they’ll do. It should be the same with load testing.”
- “Design with scepticism, and you will achieve resilience.”
BOOK SUMMARIES AND NOTES
Chapter one: Create Stability
When your software or database crashes, change the window and try something new. You can get a local engineer to help you implement the change. If changing the database, try moving to a better service provider. You can start by moving your information and data from the old database 1 to the new database two and then update database 1. After verifying the updated database, move your information and data to database 1.
Poor stability brings high actual costs. The apparent price is a loss of revenue. Good stability doesn’t cost that much. When building the architecture, design and minor system implementation, many decision points have high leverage over the system’s stability. Highly stable software usually costs the same to implement as unstable one. To discuss stability, you have to be aware of transactions. A transaction is an abstract unit of work processed by the system. A transaction span many pages, usually involving external integrations like credit card verification. Transactions are the reason behind a system’s existence. One system can process one transaction type and make it fully dedicated.
The prime uncertainties to your system’s length of service are data and memory leaks. Both dangers can kill your system during production and are rarely seized during testing. Testing visualises problems so you’re able to fix them. The issues that arise when your software is finished are those you didn’t test. Therefore, those crashes will occur when you do not test for out-of-memory errors in the application.
Favourite Quote of the Chapter: “Design with scepticism, and you will achieve resilience.”
Chapter two: Design for Production
Operations lead you to design for production considerations by examining the physical fundamentals of the system: the machines and wires that everything rests on. Resolve some issues concerning networks, hostnames and IP addresses. Every deployment has its own set of concerns that software designs must account for.
- Operations—Security, capacity, status, communication, availability
- Control Plane—Deployment, anomaly detection, features, system monitoring
- Interconnect—Routing, failover, traffic management, load balancing
- Instances—Services, components, processes, instance monitoring
- Foundation—Hardware, VMs, physical network, IP addresses
Networking within the data centre and the cloud takes more than opening a socket. These networks usually absorb more redundancy and security than desktop networks. When you attach a layer or two of virtualisation, applications and services behave more distinctively than they do in the safe confines of the IDE. They will call for extensive work to behave accurately in this environment.
An instance is an installation on a single machine (virtual or physical) out of a load-balanced array of the same executable. Individual instances provide transparency, handle configuration properly, accept control and manage connections. Every machine requires the correct code, configuration and network links. Developers usually pay attention to the behaviour of their code. That’s why they have great tools to build, house and deploy code. Developers should be capable of building a system, running tests and implementing at least a piece of the system locally.
Interconnect layers cover all mechanisms that combine a bunch of instances into a cohesive system, including traffic management, discovery and load balancing. It’s through interconnecting layers that high availability can be created. Consider the right solution for your organisation when you move up your stack to interconnect, control pane and operations. Few service discovery and innovation techniques usually depend on supplementary pieces of software. An extensive team with thousands of tiny services performs well when using Consul or any other dynamic services. Also, the cost of operating the Consul is quickly paid off. For small teams, the ideal choice in a slow-changing infrastructure is DNS. That would involve committed physical machines and dedicated long-lived virtual machines. IP addresses usually remain stable for DNS to be convenient.
Favourite quote of the Chapter: “Load balancing, routing, load shedding and service discovery are some of the key issues to consider when building layers.”
Chapter Three: Deliver Your System
You shouldn’t plan for only one or a few deployments to productions but several. After writing, zipping and sending your software for deployment to the operations, add release notes about every new configuration option they should set. Operations will set some “planned downtime” to execute the release. Most times, you design the state of the system after a release. The problem is that that assumes that the entire system can be changed in some instantaneous quantum jump.
Code is a clear liability between the time you execute code to the repository and the time it runs in production. Undeployed code is usually inventory. It has undisclosed bugs and causes production downtime. It could be an ideal implementation of a feature that no one wants. Continuous deployment minimises the delay between the execution and production of code and the liability of undeployed code.
Databases are the main reason behind “planned downtime”, mostly schema changes to relational databases. Instead of implementing raw SQL scripts against an admin CLI, you can have programmatic control to roll your schema version forward. A migration framework such as Liquibase can assist you in implementing changes to the schema. However, it does not automatically make those changes forward and backwards compatible.
When adding features to your application, be careful not to consume applications. Different consumers of your service have other objectives and needs. Every consuming application has its development team that operates on its schedule. You can’t force consumers to match your release schedule. To make compatible API changes, consider what makes for an incompatible change.
Favourite quote of the chapter: “At the same time, I had a deep sense of loss: all that time in the deployment army. All that wasted potential. The wasted humanity! Using people as if they were bots. Disrupting lives, families, sleep patterns..it was all such a waste.”
Chapter Four: Solve Systemic Problems
Load testing is usually a hands-off process. You specify a test plan, generate some scripts, configure the load generators and test dispatcher, and set off a test run throughout the night. After the test is done, analyse the data collected during the test. Examine the results, make configuration changes, and schedule the next test run. Load testing is both an art and a science. It’s unimaginable to replicate real production traffic, so use traffic analysis and intuition to achieve as close to a realistic simulation as possible.
Change is guaranteed, but survival is not. Agile development embraces change in reaction to business conditions. However, the arrow is likely to point in the other direction. Software change can generate new products and markets. It can create space for new alliances and new competition, making the surface area between businesses that used to be in different industries. Not every software needs to change daily. Some pieces of software have no potential for rapid change and adaptation. In some industries, software change goes through expensive and time-consuming certification. You have a severe transaction cost if you want to send astronauts into space with a screwdriver and a chip-puller.
Your company has to go through a decision cycle to implement a change. Someone has to sense that a need exists, another one to decide that will be a perfect fit for that need, and it’s worth implementing. Then, someone has to act and design the feature and put it on the market—finally, someone to see whether the change has the expected effect. In small companies, the process may involve two to three people. Communication is pretty fast.
Chaos engineering deals with distributed systems, usually large-scale systems. Staging or QA environments are not ideal for production systems’ large-scale behaviour. Different ratios of instances cause qualitatively different output behaviour, which also happens to traffic. Congested networks behave in a qualitatively different way from uncongested ones.
Favourite quote of the chapter: “Most testers I’ve known are perverse enough that if you tell them that “happy path” through the application, that’s the last thing they’ll do. It should be the same with load testing.”
HOW THIS BOOK CAN HELP SOFTWARE DEVELOPERS
“Release It!” by Michael T. Nygard is a practical guide that helps software developers design, develop, and deploy production-ready software. The book provides real-world examples and case studies, highlighting common pitfalls and offering solutions to avoid them. The book offers tips and techniques to identify and fix issues before deployment, including managing dependencies, testing, and monitoring. It covers topics such as performance optimisation, scalability, fault tolerance, monitoring, and logging, focusing on building software that can survive the rigors of production. By following the book’s guidelines, software developers can create reliable and resilient systems that meet the demands of their users and customers. Overall, this book is an essential resource for developers looking to improve their software’s quality and performance in the production environment.