How to build quality software products

October 10, 2022 by Christopher Sherman

Building quality software requires blending business, engineering, and artistic expertise to create a product. Beautifully-written code is not enough. We need a process for organizing the team, identifying the job to be done, and designing, implementing, and maintaining the product.

Having created, extended, and maintained several software products over the past decade, I have identified some common characteristics required to deliver quality software. In the sections that follow, I provide an overview of each, along with a checklist of the minimum set of actions required to be successful. As you embark on your software project, I hope this reference will put you on the right path and provide a helpful list of action items to refer to along the way.

Project management

Project management is the process of planning, executing, and reviewing progress towards a stated goal. We undertake a project to benefit stakeholders, the people whose interests a project serves. To complete a project, we require a team of people with the skills and expertise necessary to achieve the project goal.

Embarking on a project demands resources. These resources may be personnel, such as user experience designers and software engineers, property, such as hardware and software, or intangible goods, such as time and influence. Most projects have a resource budget which leads us to economize resources. The project management process aims to effectively utilize resources to finish a project within its budget.

Projects are typically led by a project manager. The project manager has responsibility to (Cartar 2019):

  • Interview stakeholders to understand their goals
  • Precisely define scope
  • Identify and sequence activities
  • Determine resource needs
  • Estimate costs
    • Time needed to complete activities
    • Dollars needed to complete activities in the estimated time
  • Create and manage budgets
  • Acquire resources
  • Set and manage the schedule
  • Analyze and manage risks
  • Form and lead teams

Planning a project up front and executing the plan is one way to approach a project, but there are other valid strategies. A project management methodology offers strategies and procedures for project planning and execution. By adopting a methodology, a project team becomes more autonomous because the methodology expresses the project manager’s intent without requiring specific directives. This frees the project manager from overseeing every detail of the project, thereby alleviating a bottleneck towards progress.

The chosen methodology should align with the project’s scope, complexity, and team. For small, simple projects, planning up front and executing may be all that is needed. For large, complex projects, it may not be possible to define everything ahead of time, so an iterative methodology that allows for adjustments is more appropriate. The experience, habits, and values of the team are critical considerations when selecting a methodology. People, not methodologies, carry out the project plan, so the team must understand and buy into the chosen methodology for the project to be successful.

While the project manager is responsible a project’s success, large projects require delegating responsibilities across a team. To facilitate delegation, teams often employ software tools. These tools document, delegate, measure, and report on the many aspects of a project. They often give each team member a perspective of the project appropriate to his role. When choosing tools, it is important to select ones that fit the team’s methodology.

Checklist

  • Select a project management methodology appropriate to the project’s expected scope, complexity, and team
  • Identify tools for documenting, delegating, measuring, and reporting on project management responsibilities. Tools should align with the project management methodology

User experience (UX) design

User experience (UX) design seeks to meet users’ needs in the context in which they use a product (Interaction Design 2022). These contexts include onboarding, troubleshooting, and offboarding, in addition to core functionality. While good UX is in the eye of the beholder, there are universally recognized aspects of good design which the UX design process helps incorporate into the products we build.

Durability, usefulness, and aesthetics

Well-designed products offer durability, usefulness, and aesthetics—qualities Roman architect Vitruvius Pollio identified more than 2,000 years ago (The University of Chicago Library 2011). In the context of software, durability equates to scalability, security, error handling, and alerting. Usefulness conveys itself in the form of efficient interfaces and intuitive features that allow the user to accomplish something. Aesthetics impart style, proportion, and visual beauty on the product, giving it an emotional appeal or distinctive identity in users’ minds. To design these qualities into a product, we must determine a product’s why, what, and how.

Why, what, and how

The why consists of users’ motivations for adopting a product (Interaction Design 2022). This starts with a job the user needs done. Typically, the marketplace has multiple products capable of accomplishing the same job. Which product a user chooses is based on costs relative to benefits. Costs include money and time spent learning and using a product. Benefits include accomplishing the job to be done, time saved, and how the values associated with using a product reflect on the user.

The what consists of a product’s functionality (Interaction Design 2022). To appeal to users, a product must satisfy a job its users need done. The job could be physical (such as digging a hole or satiating hunger) or cognitive (such as analyzing data or providing entertainment). Some products accomplish multiple jobs simultaneously. For instance, Google’s reCAPTCHA software protects websites from bots while also training artificial intelligence models to annotate images and digitize text.

The how is the way a product goes about accomplishing the job to be done. This consists of a product’s aesthetics and whether it is accessible to its target users (Interaction Design 2022). Intertwined with the why and what, the how differentiates products from one another.

To light a room, we can use a standard fluorescent light fixture: it accomplishes the why—seeing more clearly—and the what—artificially lighting a room. However, how fluorescent lighting goes about its job is generally not considered aesthetically pleasing. LED light fixtures, an alternative to fluorescent fixtures, accomplish the same why and what, but are flexibly shaped, capable of producing a more natural and dimmable light, and can tolerate freezing temperatures. These aspects make LEDs more aesthetically pleasing and accessible in a wider variety of environments, differentiating them from fluorescent lighting.

To economize resources, it may be tempting to short shift the how of user experience. Satisfying functional needs is critical, but ignoring aesthetics can diminish the value a product delivers. Business software is often more utilitarian than digital consumer products. But for both types of software, it is people who use them, and how people go about their work affects the perceived value of the work they do. If a business gives its employees clumsy software, employees may interpret their work as unimportant because the business did not invest in quality tools.

User stories

Between the why and how, someone must define what the product does. In agile project management methodologies, it is common to interview users to develop user stories: a few sentences outlining a particular task the user wants to accomplish (Rehkopf 2022). The project manager uses input from stakeholders to prioritize stories, and the development team turns the highest-priority stories into tasks for implementation.

User stories ensure the product solves an actual problem for a real user. Without defining user stories, it is surprisingly easy to expend significant effort on features that do not have a definitive beneficiary.

Atlassian, a developer of project management software, recommends the following process for writing user stories (Rehkopf 2022):

  • Interview users and listen to feedback. Talking to users is paramount. It helps capture the job to be done and avoid faulty assumptions. Users should not necessarily dictate how the product works, but we must ensure there is demand for the product. Once we start product development, we should get user feedback as early as possible to avoid traveling too far down an unworkable path.
  • Define personas. Who are the people in the story? Creating characters with names and titles brings the story to life, ensuring there is an actual person who will benefit from implementing the job the story describes.
  • Define “done”. A story is “done” when a user can undertake the job the story describes. It is important to define exactly what done means.
  • Outline substories. For larger stories, it makes sense to break them down into substories. This helps the team break off manageable chunks to iteratively deliver progress.
  • Order steps. Write a story for each step in a larger process. If there are multiple personas, we may want to write stories from each character’s perspective.

To get started writing user stories, it helps to have a template. Atlassian (Rehkopf 2022) recommends a simple sentence structure: “As a [persona], I [want to], [so that].” Breaking this template down:

  • “As a [persona]”: Who are we solving a problem for? We are after more than a job title; we want a persona representing a person: Mike. The team should have a shared understanding of who Mike is and how he works, thinks, and feels. Ideally the team has interviewed multiple Mike’s.
  • “I [want to]”: What is Mike is trying to accomplish at the juncture the story describes? This statement should describe intent rather than implementation details.
  • “[so that].”: What is the overall benefit Mike is trying to achieve or problem he is trying to solve? How does the immediate task the story describes fit into the bigger picture?

On some projects, the project manager interviews stakeholders and writes user stories; on others it is the UX team; on still others, it is business analysts who work directly with users and communicate between the project management and UX teams. Regardless of team structure, defining user stories is essential for establishing the inputs of why, what, and how required to deliver good UX.

Documentation tooling

We compiled quite a bit of information thus far: user stories; why, what, and how; characteristics that make our product durable, usable, and aesthetically pleasing. To capture this information and enable collaboration, we need a place to store it. For the smallest projects, we can use text files and spreadsheets. For products of any size and complexity, we need a documentation tool. This tool should offer versioning, real-time collaboration, notifications, comments, search, tables, charts, and templates for organizing information. Such a tool enables us to inform the team, make decisions, and remember why we made certain decisions.

Design System

Having documented the motivation behind our product, the obvious next step is creating a prototype. However, before thinking about implementation, we should establish a design system: clear standards that guide the creation and usage of modular components (Fanguy 2019). A design system is about more than colors and font sizes—it explains why and how to use each component and ensures different components fit together with a shared look and feel.

Think of a design system as a physical toolbox: when we encounter a job to be done, we go to our toolbox to select one or more tools to tackle the job. Each tool comes with a set of instructions. The instructions explain why each tool exists and how to use it. It may include examples of using a tool in combination with other tools. Just as manufacturer Snap-on created a set of modular rachets with the slogan “Five to do the work of Fifty” (Snap-on 2022), the standards of our design system should lead to modular components we can combine for use in a variety of scenarios.

At first, our toolbox may be sparsely outfitted—we may even have to use the blunt side of a screwdriver as a hammer. Over time, we can add more and specialized tools, allowing us to accomplish jobs more effectively. This, in turn, enables us to deliver better user experience.

When it comes to adding tools to our toolbox, it is not necessary to build everything ourselves. One way to accelerate initial product development is adopting an existing design system. Chances are, there are some challenges of building our product others have already faced. Mature design systems and their associated component libraries offer modular components for solving commonly faced problems.

Whether we leverage an existing design system or create one from scratch, we must capture its design standards in our documentation tool. For individual components, screenshots of each component adjacent to descriptions of why the component exists and how to use it will get the job done. However, it is best to use an interactive tool to bring together live components, examples, and documentation in one place. Storybook allows anyone on the team to visualize and interact with the entire design system toolbox, complete with instructions alongside each tool.

A design system not only encourages a unified look and feel, it also brings predictability to prototyping and development efforts. Because we outfit our toolbox up front, prototyping consists of modularly combining existing components rather than imagining everything from a blank canvas. To turn prototypes into code, developers reach for pre-built components, increasing the certainty with which they estimate implementation time. We should have thoroughly tested these components when adding them to the design system, so quality assurance and bug fixing cycles should become shorter as well.

Prototyping

With a design system in place and user stores in hand, it is time to prototype the product. In the context of user interfaces, it is common to create wireframes: a set of blueprints that help the team think and communicate about the structure of the software it is building (Guilizzoni 2022). Whether for an entirely new product or a new feature of an existing product, we create wireframes before writing any code.

Historically, a distinguishing feature of wireframes was their low fidelity. Crudely sketching a view’s structure in black and white allows for quickly incorporating feedback during multiple wireframe iterations. Once the team arrives at a structure it likes, it adds details, such as a color, animations, and precise spacing—a process known as visual design.

Assuming we have a design system in place, we collapse the wireframe and visual design processes into one step. Modern design platforms enable us to drag and drop copies of our design system components onto a prototyping canvas. We already defined the visual design of each component when establishing our design system, so there is no additional cost to adding the full-featured component to our prototype. Figma, a design platform, allows creating interactive prototypes, complete with click-through navigation, animations, and multi-layer overlays.

Seven factors of good UX

So far, we established UX is a multidisciplinary, qualitative process. To define bring things together in a slightly more quantitative way, we refer to Peter Morville, author of several best-selling UX books. Morville identifies seven factors that equate to good UX (Interaction Design 2022):

  1. Useful: has a purpose for its users, i.e., the job to be done
  2. Usable: enables users to achieve their objectives effectively and efficiently
  3. Findable: usefulness is easy to find
  4. Credible: information is accurate, and the product lives up to its promises
  5. Desirable: conveyed through branding and aesthetics
  6. Accessible: usable by the full range of the target userbase
  7. Valuable: sum of the other six factors

Assigning a weight each of the first six factors and then rating a product against them, we arrive at a value score. By rating products according to the same factors, we have a means for comparing the UX of products inside and outside our organization.

Checklist

  1. Documentation tool (e.g., Coda)
  2. Define the product’s why, what, and how
  3. What makes this product durable, useful, and aesthetically pleasing?
  4. Interview users to establish the job to be done
  5. Write user stories
  6. Revisit steps two and three
  7. Component library tool (e.g., Storybook)
  8. Establish design system
  9. Design platform (e.g., Figma)
  10. Create prototype
  11. Score the prototype according to the seven factors of good UX to determine whether the product is worth implementing

Integrated development environment

An integrated development environment (IDE) is a tool for writing and modifying source code. Commonly referred to as an editor, an IDE offers syntax highlighting, auto-completion, real-time validation, debugging, contribution history, automated refactoring, and conveniences for running terminal commands, among other efficiencies. My preferred IDE is Visual Studio Code (VS Code), is a lightweight editor with extensions for enhancing its core functionality.

Checklist

  • Recommend an IDE (e.g., Visual Studio Code)
  • Document recommended IDE configuration and extensions

Source control

According to Amazon Web Services (2022):

Source control is the practice of tracking and maintaining changes to code. Source control management (SCM) systems provide a running history of code development and help to resolve conflicts when merging contributions from multiple sources.

A SCM enables collaboration by providing tools to review and track changes to code. Git is a widely used SCM where developers pull down a complete copy of a repository on their individual machines. Git refers to a copy of the code as a branch. One branch serves as the trunk or master copy branch. From this branch, developers create additional copies of the code (i.e., branches) in which to make their changes. A collection of branches holding different variations of the same code is known as a repository. When a developer is ready to share his changes, he publishes his branch to a remote repository for review. Once the team approves the changes, the developer merges his changes into the trunk branch.

Trunk-based workflow

There are many branching strategies for developing, testing, and deploying code. My preference is the trunk-based workflow, where the trunk branch is ready to deploy at all times. To develop features or fix bugs, a developer creates a short-lived branch to which only he commits. Once the team approves the changes to the branch, the developer merges them to the trunk, at which point a continuous integration/continuous deployment (CI/CD) tool verifies the changes and deploys the new version of the product. The short-lived nature of branches reduces the potential for merge conflicts, and continuous deployment fits the software-as-a-service (SaaS) paradigm. Because some features cannot be developed within a day or two, we use strategies such as branch by abstraction or feature flags to enable us to merge incomplete features into the trunk branch, thereby keeping the life of branches short.

Gitflow workflow

An alternative strategy is the Gitflow workflow. This workflow defines a strict branching model designed around project releases. The master branch stores the official release history, with each commit tagged with a version. A development branch serves as an integration branch for fully completed features. Feature branches exist until the feature is complete and interact only with the development branch.

When preparing for a release, we fork off the development branch and commit fixes to the release branch until it becomes stable, at which point we merge the release branch into both the master and development branches. If we need a hotfix for a released version of the product, we create a hotfix branch from master, commit the fixes, and merge the hotfix into the master branch. If the fix also applies to the development branch, we merge the hotfix branch into the development branch.

Due to the strict nature and complexity of Gitflow, it comes with a CLI tool to help the team adhere to the workflow’s rules. The Gitflow strategy is useful for managing large projects supporting multiple versions of a product simultaneously. It is not typically used in continuous deployment scenarios due to long-lived feature branches and an extensive release process.

Checklist

  • Choose an SCM system for tracking source code
  • Choose a branching strategy and document it, including example scenarios

Logging

Logging events allows us to analyze errors, track performance, identify threats, and gain insights. When we identify a bug, logs allow us to recreate the state that led to the error. By measuring performance, we can identify when modifications to code, configuration, or infrastructure impact user experience. Similarly, recording user behavior enables us to prioritize our work, identify deviations that indicate malicious behavior, and improve marketing efforts.

The software industry commonly discusses logging in terms of monitoring, alerting, and observability. Monitoring consists of gathering predefined logs and metrics. Alerting notifies interested parties when the information the system monitors deviates from set thresholds. Observability combines monitoring and alerting, giving teams the ability to debug their systems.

Log analysis software provides sorting, filtering, and alerting capabilities to make sense of our logs. We aim to avoid monitoring unnecessary data because it obscures the information we want to uncover and is costly to sift through. That said, we do want logs of relevant data, and log analysis software enables logging potentially-relevant events without having to manually review a sea of log statements. This improves observability.

Each log statement should contain metadata in addition to its payload. The format should be human readable, developer friendly, and widely supported (e.g., JSON) to avoid vendor lock-in and facilitate forensic analysis in the event of a security breach. When possible, limit the payload to a single line (approximately 80-256 characters) to improve indexing, querying, and disk compression (Splunk 2022).

Metadata

  • Classification: error, warning, information, or debug
  • Timestamp: granular (e.g., microseconds) with a UTC offset
  • Unique identifier per transaction, allowing us to follow the transaction through multiple log statements
  • Source: service name, service revision number, filename, class or function, and line number

Payload

  • Text-based format (avoid binary files)
  • Break multi-line information into multiple statements, where possible
  • Should not contain sensitive data

Deciding what to log and how to analyze it is the primary focus, but we also need a strategy for storing logs. Begin by writing logs to local storage. Local storage provides a buffer in the event of network latency or failure (Splunk 2022). From local storage, send logs to a service that stores them in a repository. Depending on our system design and log analysis software, we may have a centralized repository or a repository per service. Either way, we need strategies for handling writes during periods of peak throughput and for provisioning additional storage automatically to avoid running out of space. We also need plans for backing up and restoring logs. These processes ensure the information we log is available to analyze.

Checklist

  • Style guide defining log format, metadata, and payload
  • Analysis tool (e.g., Prometheus, Grafana) for aggregating logs across services with the following capabilities:
    • Search, sort, and filter
    • Define anomaly thresholds and generate alerts
    • Reports
  • Automate storage management to avoid running out of space
  • Define expected peak throughput and determine throughput capacity
  • Configure repository redundancy/backup

Linting

Linting is a form of static code analysis that identifies syntax errors, style issues, and possible bugs. Most linters have a configuration file where we define which code we want to analyze and what kind of analysis we want to perform. We may run multiple linters on the same repository.

Each programming language has its own syntax, and linters alert us to problems before we attempt to compile or interpret the code. Software teams should define a style guide to ensure the code they write is consistent and understandable by the entire team. Linters alleviate the team from having to police adherence to style guidelines by automatically detecting code that does not respect the established guidelines. Advanced linters can detect potential bugs, such as memory leaks and infinite loops. Some linters even automatically fix problems.

Checklist

  • Define a style guide for each programming language
  • Create configuration files for each linter
  • Configure your IDE to display linting errors inline
  • Configure each service to run linters as a Git pre-commit hook
  • Configure each service to run linters as part of the continuous integration pipeline

Testing

There is an axiom that code written without test coverage immediately becomes legacy code. Developers are loath to change legacy code for fear they overlook a hidden requirement, causing the application to stop behaving as expected. Tests document requirements and give us confidence our code works. With automated tests in place, we have the freedom to refactor and extend code and quickly receive feedback as to whether it still works after our changes.

Tooling is an integral part of test coverage. Test runners provide statistics, such as the percentage of lines covered with tests. Testing frameworks provide convenience functions that make writing tests easier. However, even if the tooling reports 100 percent test coverage, this does not necessarily mean we captured all requirements. There may be edge cases, such as extremely large values, that cause our code to break even if we have a test verifying the same line of code works with smaller values.

We should capture all requirements by writing automated tests at the time we write the code. This ensures the requirements are fresh in our minds when we capture them with test coverage. When deadlines approach, it is tempting to put off writing tests for later, however, this is exactly the time we most need test coverage. When stress is high and attention to detail short, mistakes multiply. Good discipline with test coverage helps keep mistakes under control, limiting the number of bugs developers write into the code.

There are three primary types of tests: unit, integration, and end-to-end. Unit tests are the smallest and fastest, verifying a particular unit of code works as expected. Integration tests combine two or more units of code, verifying we get the expected behavior when using them together. End-to-end tests bring up a full system and verify a given business process from start to finish.

Most test coverage should exist in the form of unit tests. If the units of code making up an integration or end-to-end test do not pass, these higher-level tests should not pass either. Because unit tests verify the smallest chunks of code, they are the fastest to run and simplest to maintain. Create integration and end-to-end tests judiciously, as they are less resilient to change and take more time to setup and run. Running tests gives us confidence our software is working, but we must economize the time it takes to write, maintain, and run tests against development and deployment velocity.

Tests are only valuable if we run them. Tooling should automatically run tests whenever a developer makes a commit, preventing the commit if the tests fail. Tests should also run whenever a developer pushes a commit to a remote branch; if any tests fail, the server tooling should notify the developer and mark the commit as failed. We might economize by running only unit tests when a developer makes a commit, while running integration and end-to-end tests, in addition to unit tests, when a developer merges to the trunk.

Checklist

  • Define the minimum level of acceptable test coverage
  • Which integration and end-to-end scenarios do we want to test?
  • Put tooling in place to report test coverage, enforce thresholds, and visualize missing coverage
  • Put tooling in place to automatically run tests and notify of failures

Debugging

When software does not work as expected, we need a way to quickly identify the cause of the problem (i.e., the bug). This process, known as debugging, may be as simple as logging function calls and state to the standard output log, or it may use tools to attach to the software process, allowing us to pause function execution to examine state at demarcated breakpoints. Debugging tools allow us to shorten the time between the identification of a bug and the time we have a fix ready to deploy.

VS Code has a JavaScript debugger for setting breakpoints, stepping through functions, and examining state. To configure it, we add an entry to the .vscode/launch.json file. This debugger even supports conditional breakpoints, which only activate if the state matches our condition.

Checklist

  • Configure the IDE debugger to debug running code (e.g., browser, CLI) and tests

Code review

Even the best developers make mistakes. Establishing a code review process enhances software quality by checking for things the author overlooked. This might be covering an edge case (even with 100 percent test coverage), pointing out a bug, or suggesting a better design.

Code review improves developers in addition to the software they write. Developers exchange ideas during code review, thereby leveraging the expertise of their teammates. This allows reviewers to pick up new techniques from reviewing others’ code and authors to receive feedback on areas to improve. Ideally, both parties will leverage what they learn to write better code in the future.

Code review is a form of feedback, and there are a few guidelines to keep in mind:

  • Remember there is a human being on the other end of the review. Even when there is significant room for improvement, deliver the message with respect.
  • Be clear. Make it easy for the author to understand what you are saying:
    • Avoid ambiguity (e.g., avoid using “it” or “that”)
    • State directly what is wrong and why
    • Suggest a path forward
  • Assume good intent. Not everyone will write code like you. Ask yourself whether the code is acceptable even though it may not be how you would do it.
  • Indicate nitpicks. If you do not agree with a particular choice but this should not keep the code from going to production, prepend your comment with Nitpick.
  • Include praise. When you come across particularly elegant or maintainable code, let the author know. That said, avoid obsequious comments.

The guidelines above also apply to the person receiving feedback. When others take time to review our code, we should assume they have our best interest in mind, as well as that of the product. While it may be frustrating to find our code needs work, we should diligently implement valid suggestions. When we disagree, we should do so respectfully, clearly explaining our thoughts.

Tooling is vital for code review. Tools indicate what changed, provide the ability to add comments on a line-by-line basis, and enable workflows for approving or denying changes. Many tools integrate with the build system, allowing us to see whether artifacts built successfully, tests passed, and test coverage thresholds were met. When the team approves changes, tooling can merge the code with the click of a button.

It may be tempting to skip code reviews when timelines get tight. However, this is when we most need multiple sets of eyes ensuring we maintain high code quality. Stress can affect attention to detail, leading to bugs and brittle code. Be sure to consistently adhere to the code review guidelines even when it feels like time is short. It costs far less to catch and fix a bug in code review than it does in production.

A picture is worth a thousand words, and code review is no exception. When reviewing changes to user interface code, being able to see a live deployment can be as helpful as reviewing the code itself. Using a continuous integration pipeline, we can automatically deploy a test instance each time a developer generates initiates a code review. For reviewing reusable components, tools like Storybook and its cloud companion Chromatic give us visual diff tools to see what changed. These tools allow less technical team members to participate in code review by providing feedback on changes before they make it into staging or production.

Checklist

  • Choose a tool for code diff, comments, and approval workflow
  • Integrate with the build system to indicate success/failure for building artifacts, tests, and test coverage
  • Establish guidelines for giving and receiving feedback in code review
  • Add an integration step to generate a test deployment for changes to user interfaces
  • Adopt tooling for visual regression testing (e.g., Storybook)

Third-party libraries

Third-party libraries accelerate development by providing high-quality features we do not need to build ourselves. They can also inhibit maintainability when library maintainers do not quickly address bugs or support new operating systems, browsers, and frameworks. Often, libraries come with parts we do not use, and the design of each library dictates whether we can keep unused code from weighing down our bundle size.

Checklist

  • Document third-party libraries we take dependencies on with an explanation why we need each one
  • Validate each library is well-maintained
  • Add configuration to remove parts of the library we do not utilize
  • Verify our libraries do not overlap in functionality
  • Verify component libraries fit our look and feel
  • Regularly validate libraries are up to date and free of known security vulnerabilities
  • Adopt a strategy for automatically updating libraries (e.g., GitHub Dependabot)

Continuous integration and continuous deployment

Continuous integration (CI) and continuous deployment (CD) are strategies for automating parts of the software development process. CI automates development processes while CD automates operations processes. Automation occurs in stages, with each stage operating on the output from a previous stage, similar to a shell command operating on output from a previous command via the pipe operator. For this reason, the industry refers to a collection of stages as the CI/CD pipeline.

Continuous integration (CI)

CI encourages developers to frequently merge code and alerts them of problems. When CI detects a new development branch, it automatically lints the code, runs tests, and generates a build artifact. Should anything fail, CI notifies the author of the issue, usually with a link to logs for debugging the problem. By automating the code integration process, CI gives us confidence our code does not break things.

Without CI, teams typically resort to a “merge day” strategy. On merge day, all developers working on a particular repository meet to merge their code, resolve merge conflicts, and test the resulting code. Since these meetings pause development, they tend to occur infrequently. This strategy increases the likelihood of conflicts and, by extension, mistakes resolving conflicts.

To eliminate merge days, we first define coding standards. CI enforces development standards through automation—without standards in place, there is nothing to enforce. Because CI enforces our standards, we allow developers to continually merge their changes after receiving code review approval. Merging more often reduces the probability of merge conflicts by reducing the time between developers pulling the latest code and merging in their changes.

For CI to work, we must configure linting, write tests, and configure builds to run directly from the command line. In addition, each stage must run quickly so as not to deter developers from frequently pushing their changes. Having met these requirements, we leverage CI tools to establish stages in the integration pipeline, monitor for changes, collect logs, and report the success or failure of each stage.

Continuous deployment (CD)

CD automates the release of integrated code. CI is a prerequisite to CD, because we only want to deploy validated code. The CD process starts as soon as a new commit hits the trunk or master branch. By automating the deployment process, we avoid the operations team becoming a bottleneck in the release process.

The first stage in CD is initializing the execution environment. We define the environment in code using Docker images. By defining the execution environment in code (e.g., Dockerfiles), we eliminate the one-by-one configuration of physical environments. Instead, the Docker runtime stamps out instances of the execution environment via Docker images. An instance of a Docker image is known as a Docker container. Docker containers isolate the execution environment from the host machine on which our environment runs, increasing predictability and opportunities for automation.

The amount of orchestration required to deploy code differs between libraries and services/applications. For libraries, the deployment process may involve just two stages: initializing the execution environment and running the publish command. For services, we need to replace an existing release with a new one in an automated fashion. Orchestrating this update requires additional tooling.

Kubernetes (K8s) is an orchestration platform for deploying, scaling, and managing Docker containers (as well as other container technologies). I will focus on automating updates, but K8s has numerous other features for orchestrating containerized workloads.

K8s offers load balancing, giving us the ability to run multiple instances of our services. When it comes time to release an update, we configure K8s to terminate one container of the service we want to update while we spin up a new container holding our updated code. Once K8s detects the new container is ready, it notifies the load balancer to direct traffic to the new container. K8s repeats this process until all containers within the workload, known in K8s parlance as a pod, are running the new version.

The update process described above glosses over numerous complexities. Whether the service holds state and whether the updated service API is forward and backward compatible determines the extent to which we can automate the update without disrupting users. The update process I described is known as a rolling update strategy. There are alternative strategies, such as blue/green deployment, that allow K8s to switch from one version to another all at once. Which strategy is most effective depends on the nature of each service.

Instead of deploying directly to production, it is common to first deploy to testing and staging environments. In the testing environment, stakeholders test the new version, indicating whether they accept or decline the changes. We already ran automated tests during the CI process, so this testing is a manual process. Once stakeholders accept the changes, we configure the CD pipeline to deploy to a staging environment, giving the team one last chance to verify the service is ready for production. In sophisticated CD pipelines, the approval process is part of the pipeline, allowing us to automate the deployment of the new version to staging and production as soon as stakeholders sign off.

As a final step in the CD pipeline, some teams run smoke tests: basic checks on key functionality of the updated service. The motivation behind smoke tests is to verify a new deployment is working as expected and quickly roll back if the tests fail. Rollback is yet another feature K8s can orchestrate automatically.

Checklist

  • Define coding standards (e.g., linting rules, test coverage levels)
  • Adopt tooling to enforce coding standards
  • Select CI/CD tooling
  • Select container and orchestration technologies
  • Define CI/CD stages

Security

Exposing web services over a network disseminates their benefits to our user base, but it also exposes us to attack. To keep our services secure, we need to develop a security strategy for mitigating commonly exploited attack vectors.

The security landscape is constantly evolving, so I will not attempt to document each attack vector here. As a whole, the engineering team should familiarize itself with the Open Web Application Security Project (OWASP) Top 10 security risks. OWASP maintains descriptions, example scenarios, and mitigation measures for the most critical security risks to web services. Within the engineering team, we should dedicate specific personnel to develop and adapt our security strategy. This security team should audit our services against risks on a scheduled interval and make recommendations for addressing any security holes it identifies.

In addition to in-house security audits, teams should consider periodic third-party security audits. Professional security auditors see a variety of environments and should be familiar with both common and exotic vulnerabilities. Before the audit, we need to decide what level of testing we want the auditor to perform.

  • Black-box testing occurs from outside the internal network without any special knowledge of the system.
  • Gray-box testing gives the auditor more information, usually in the form of system architecture details and elevated permissions. In gray-box testing, the auditor has access to the internal network (i.e., operates inside the firewall).
  • White-box testing gives the auditor full access to source code, documentation, and the network.

By leveraging a third-party auditor, we bring a fresh set of eyes and additional expertise.

A significant way to enhance the security of web services is adopting a third-party identity service (i.e., authentication and authorization service). While outsourcing this critical aspect of security is a risk, there are several established vendors whose services undergo regular security audits to ensure their products are secure against known exploits. Most authentication and authorization protocols have numerous steps and rules that require strict implementation. Relying on a vendor who specializes in these protocols is a better option than enlisting a few engineers who dabble in it. While I generally avoid recommending third-party services that create vendor lock-in, the cost of a security breach justifies the tradeoff.

A key practice for enhancing security is keeping infrastructure and software dependencies patched with the latest updates. This basic yet powerful step protects our services from known vulnerabilities. The work involved with updating versions may not provide benefits in terms of user-facing features, so teams often put off this effort until a critical vulnerability arises, at which point the effort to update may be quite large. To avoid this situation, we should establish a regular cadence for patching dependencies. We also need to monitor for patches of critical vulnerabilities that require immediate attention.

Software vendors typically establish end-of-life dates for their products, after which they will no longer provide security patches. We need a plan for ensuring we migrate to a supported version of each product before reaching its end-of-life date. Having the latest patch for version 12 does us no good if months ago version 12 entered end of life, leading the vendor to stop patching it.

In a similar vein, we need to plan for the expiration of SSL certificates. These certificates verify our web traffic and make encryption possible. When a certificate expires, web clients indicate the connection is insecure. Not only could this harm the reputation of our services, continuing to operate with expired certificates could accustom our users to ignoring security warnings.

Having a strategy for protection is the first step, but equally important is monitoring for attacks. Monitoring should be a combination of:

  • Collecting and alerting on metrics specific to our services
  • Employing off-the-shelf products that monitor and protect against more general attacks

Off-the-shelf products include firewalls and distributed denial of service (DDoS) protection. Firewalls limit access to our network and alert on suspicious traffic, while DDoS protection drops nefarious requests, keeping them from overwhelming our services. Monitoring identifies attack vectors adversaries are specifically targeting and provides early notification should a breach occur. This allows us to deploy mitigation measures to limit damage.

Protection and monitoring reduce the attack surface, but the sheer number of attack vectors make it difficult to stop a determined adversary. A final aspect of a security strategy is planning for and simulating our response to a security breach. This includes restricting services to read-only access or taking them offline, restoring from backups, and communicating with internal and external stakeholders. While there will be unique circumstances surrounding a particular security breach, having a plan in place for the foreseeable aspects will help us make better decisions during what can be a stressful situation.

Checklist

  • Familiarize the engineering team with the OWASP Top 10
  • Establish a security team to develop, execute, and adapt a security strategy
  • Schedule recurring security audits
  • Periodically employ a third-party security auditor
  • Consider leveraging a third-party identity service for authentication and authorization
  • Establish a regular cadence for patching infrastructure and software dependencies
  • Monitor infrastructure and software dependencies for patches of critical vulnerabilities
  • Identify end-of-life schedules for infrastructure and software dependencies and make plans to migrate to supported versions in advance of the expiration dates
  • Identify expiration of SSL certificates and make plans to replace them in advance of the expiration dates
  • Monitor services-specific vulnerabilities and automatically alert when issues arise
  • Leverage off-the-shelf firewall and DDoS products
  • Establish a plan for responding to foreseeable security breaches and simulate plan execution

Backup and disaster recovery

Hardware failures, human error, malicious actors, and weather events are common causes of lost data. We can develop excellent software, infrastructure, and documentation, but it does us no good if we lose them with no backup. In the event we do experience data loss, having a backup is not enough—we need a disaster recovery plan to minimize downtime, limiting the disruption to our employees and customers.

When developing a disaster recovery plan, we need to cover the following concepts:

  • Recovery time objective (RTO) is the time required to recover normal operations after suffering an outage (IBM Cloud Education 2018). The RTO varies between businesses and could vary for the same business at different stages in its development. For instance, a startup with few employees and no customers may be willing to deal with a longer RTO if its engineers can continue to work on their local workstations during an outage. Later, when the organization has hundreds of employees and paying customers, it will need to invest additional resources to bring down its RTO as the opportunity cost of downtime has grown.
  • Recovery point objective (RPO) defines how much data we are willing to lose in an outage (IBM Cloud Education 2018). We can determine this metric in terms of the volume of data lost or the time during which a service appears to be accepting data but is actually losing it. For eCommerce customers, losing customer orders is usually worse than simply not accepting orders.
  • Failover is the process of automatically transitioning workloads to redundant systems in a way that is seamless to users (IBM Cloud Education 2018).
  • Restore is the process of recovering back up data on the primary system (IBM Cloud Education 2018).
  • Failback is the process of returning workloads to the original systems after we restore them (IBM Cloud Education 2018).
  • Geography allows us to diversify risk but also affects RTO and RPO. Storing backups and failover systems in a different geographic area from the production system mitigates risk through the reduced likelihood of a power, weather, or political disruption affecting multiple locations. However, greater distance between sites typically reduces the speed at which we can transfer data for backup and restore. If the redundant site is physically distant from our users, they may experience increased latency while running workloads on the failover site.
  • Hardware uniformity between primary and failover infrastructure increases the likelihood of simultaneous failures. If the hardware running both systems is the same model and age, the time at which each fails could be highly correlated. Even so, we should weigh the increased operational complexity of managing less uniform infrastructure against the benefits of decorrelating failure events.
  • Service provider outages affect us when their disaster becomes our disaster. The services we depend on have their own backup and disaster recovery plans which affect the service standards they agree to provide us, commonly referred to as service level agreements (SLAs). Outsourcing workloads to service providers allows us to leverage economies of scale and expertise we may not have in-house. That said, even when their SLAs are superior to what we could provide, we must acknowledge the possibility of third-party disruptions by factoring their SLAs into our RTO and RPO metrics.

If the concepts above are more than the project team wants to manage, we can consider adopting disaster recovery as a service (DRaaS). DRaaS involves a third party hosting and managing the backup and disaster recovery infrastructure. By leveraging services from a cloud provider, we can maintain a high degree of control over the backup and disaster recovery process without having to build and manage the infrastructure ourselves. Taking it a step further, there are managed service providers who will manage both the process and the infrastructure. It is common for organizations to start out using DRaaS and transition to direct management as the organization grows.

Regardless of who operates the infrastructure and process for backup and disaster recovery, we must regularly test our plan to ensure its effectiveness. Having a plan does nothing if our backups silently fail for weeks before a disaster occurs. Even when using DRaaS, our project team is the most motivated party for ensuring the backup and disaster recovery plan will work as expected. We should schedule regular tests, adjusting the frequency based on the operational cost of executing a test relative to the cost of experiencing data loss.

Checklist

  • Define the recovery time objective and recovery point objective for each service. This includes services that make up the product as well as services the organization depends on, such as documentation and source control management.
  • Define a plan for backing up data from each service.
  • Define a plan for recovering from an outage of each service. Be sure to consider cascading failures between inter-related services.
  • Determine whether to host backup and disaster recovery infrastructure on premise or in the cloud. This infrastructure should not exist at the same location as the production infrastructure.
  • Schedule regular tests of the backup and recovery process.

Performance testing

Performance testing evaluates how the product performs under peak load. The first step in performance testing determining what level of load each service should support. While it would be nice if each service scaled infinitely, scaling comes at a development and operational cost, so it is necessary to right-size the cost of scaling to our load expectations.

Performance needs affect almost every aspect of the software stack. Workloads that require high levels of runtime and memory performance could dictate C/C++ as the programming language as opposed to languages with automatic memory management, such as Python or C#. Big data workloads require partition-tolerant databases and a choice between consistency and high availability (i.e., CAP theorem). Transactional data requires additional management in a distributed system (e.g., ACID compliance). In-memory data stores often have different data models than application databases, which have different data models than data warehouses. Time-series data commonly requires downsampling for queries to remain performant. The request load a given service instance must handle dictates infrastructure requirements as well as load balancing, concurrency, and more. Each of these performance considerations entails enough complexity to be its own specialty.

To avoid getting bogged down with optimizations and their associated complexities, we must strike a balance between speed of development and robustness of our system. The first step is to define our current performance needs and forecast how demand may change in the future.

When starting a new project, teams commonly develop a prototype and give limited consideration to performance and other aspects of durability (e.g., testing). This strategy assumes the prototype will test a hypothesis. If the hypothesis proves true, we will throw away the prototype, taking the lessons learned building it to plan and develop a robust system. However, it is common for the rebuild phase to never occur, instead putting the prototype directly into production. This inevitably leads to countless Band-Aid fixes until it the system becomes an unreliable, inextensible mess.

When building a service, I recommend implementing it with at least the minimum level of performance we are comfortable having in production. This preempts creating a leviathan and keeps the team in the habit of adhering to development standards, including performance. Strong teams usually have members whose responsibility, at least in part, is testing against performance standards, so implementing with performance in mind also ensures we utilize our team’s resources.

Having established performance standards and implemented at least a portion of a service, it is time to test against the standards. As with any form of testing, we want to automate as much as possible. Automation frees our personnel to work on non-repetitive, value-added tasks and enables us to continuously enforce performance standards as part of our CI/CD pipeline.

The final step in performance testing is measuring performance and alerting when we detect degradation. The simplest strategy generates an alert when performance falls below a predefined threshold. More sophisticated strategies perpetually measure performance to establish a moving average and alert when performance deviates. By integrating performance into the CI/CD pipeline, we catch commits that degrade performance before they impact our users.

Checklist

  • Define current and future performance needs for each service
  • Define performance standards
  • Enforce performance standards in the CI/CD pipeline
  • Measure performance and alert on deviations (see section on logging)
  • Establish a regular interval for reevaluating performance needs and standards

Sources

Amazon Web Services. (2022). What is Source Control?https://aws.amazon.com/devops/source-control/

Cartar, K. (June 28, 2019). A Comprehensive Guide to Managing Web Development Projects. Hackernoon. https://hackernoon.com/a-comprehensive-guide-to-managing-web-development-projects-8364f2230eb7

Fanguy, W. (June 24, 2019). A comprehensive guide to design systems. InVision. https://www.invisionapp.com/inside-design/guide-to-design-systems/

Guilizzoni, P. (n.d.). What are wireframes. Retrieved July 15, 2022, from https://balsamiq.com/learn/articles/what-are-wireframes/

IBM Cloud Education. (December 6, 2018). Backup and Disaster Recovery. IBM. https://www.ibm.com/cloud/learn/backup-disaster-recovery

Interaction Design. (n.d.). User Experience (UX) Design. Retrieved July 8, 2022, from https://www.interaction-design.org/literature/topics/ux-design

Rehkopf, M. (n.d.). User stories with examples and a template. Retrieved July 11, 2022, from https://www.atlassian.com/agile/project-management/user-stories

Splunk. (2022). Logging best practices in an app or add-on for Splunk Enterprise. https://dev.splunk.com/enterprise/docs/developapps/addsupport/logging/loggingbestpractices/

Snap-on. (n.d.). Snap-on History Timeline. Retrieved July 14, 2022, from https://www.snapon.com/EN/Our-Company/Our-History

The University of Chicago Library. (May 9, 2011). Firmness, Commodity, and Delight. https://www.lib.uchicago.edu/collex/exhibits/firmness-commodity-and-delight/

Architecture