Building quality software involves blending business, engineering, and artistic expertise to create a product. Beautifully-written code is not enough. We need a process for organizing the team, identifying the job to be done, and designing, implementing, and maintaining the product.
Having created, extended, and maintained several software products over the past decade, I have identified some common characteristics required to deliver quality software. In the sections that follow, I provide an overview of each, along with a checklist of the minimum set of actions required to be successful. As you embark on your software project, I hope this reference will put you on the right path and provide a helpful list of action items to refer to along the way.
Project management is the process of planning, executing, and reviewing progress towards a stated goal. We undertake a project to benefit stakeholders, the people whose interests a project serves. To complete a project, we require a team of people with the skills and expertise necessary to achieve the project goal.
Embarking on a project demands resources. These resources may be personnel, such as user experience designers and software engineers, property, such as hardware and software, or intangible goods, such as time and influence. Most projects have a resource budget which leads us to economize resources. The project management process aims to effectively utilize resources to finish a project within its budget.
Projects are typically led by a project manager. The project manager has responsibility to (Cartar 2019):
Planning a project up front and executing the plan is one way to approach a project, but there are other valid strategies. A project management methodology offers strategies and procedures for project planning and execution. By adopting a methodology, a project team becomes more autonomous because the methodology expresses the project manager’s intent without requiring specific directives. This frees the project manager from overseeing every detail of the project, thereby alleviating a bottleneck towards progress.
The chosen methodology should align with the project’s scope, complexity, and team. For small, simple projects, planning up front and executing may be all that is needed. For large, complex projects, it may not be possible to define everything ahead of time, so an iterative methodology that allows for adjustments is more appropriate. The experience, habits, and values of the team are critical considerations when selecting a methodology. People, not methodologies, carry out the project plan, so the team must understand and buy into the chosen methodology for the project to be successful.
While the project manager is responsible a project’s success, large projects require delegating responsibilities across a team. To facilitate delegation, teams often employ software tools. These tools document, delegate, measure, and report on the many aspects of a project. They often give each team member a perspective of the project appropriate to his role. When choosing tools, it is important to select ones that fit the team’s methodology.
User experience (UX) design seeks to meet users’ needs in the context in which they use a product (Interaction Design 2022). These contexts include onboarding, troubleshooting, and offboarding, in addition to core functionality. While good UX is in the eye of the beholder, there are universally recognized aspects of good design which the UX design process helps incorporate into the products we build.
Well-designed products offer durability, usefulness, and aesthetics—qualities Roman architect Vitruvius Pollio identified more than 2,000 years ago (The University of Chicago Library 2011). In the context of software, durability equates to scalability, security, error handling, and alerting. Usefulness conveys itself in the form of efficient interfaces and intuitive features that allow the user to accomplish something. Aesthetics impart style, proportion, and visual beauty on the product, giving it an emotional appeal or distinctive identity in users’ minds. To design these qualities into a product, we must determine a product’s why, what, and how.
The why consists of users’ motivations for adopting a product (Interaction Design 2022). This starts with a job the user needs done. Typically, the marketplace has multiple products capable of accomplishing the same job. Which product a user chooses is based on costs relative to benefits. Costs include money and time spent learning and using a product. Benefits include accomplishing the job to be done, time saved, and how the values associated with using a product reflect on the user.
The what consists of a product’s functionality (Interaction Design 2022). To appeal to users, a product must satisfy a job its users need done. The job could be physical (such as digging a hole or satiating hunger) or cognitive (such as analyzing data or providing entertainment). Some products accomplish multiple jobs simultaneously. For instance, Google’s reCAPTCHA software protects websites from bots while also training artificial intelligence models to annotate images and digitize text.
The how is the way a product goes about accomplishing the job to be done. This consists of a product’s aesthetics and whether it is accessible to its target users (Interaction Design 2022). Intertwined with the why and what, the how differentiates products from one another.
To light a room, we can use a standard fluorescent light fixture: it accomplishes the why—seeing more clearly—and the what—artificially lighting a room. However, how fluorescent lighting goes about its job is generally not considered aesthetically pleasing. LED light fixtures, an alternative to fluorescent fixtures, accomplish the same why and what, but are flexibly shaped, capable of producing a more natural and dimmable light, and can tolerate freezing temperatures. These aspects make LEDs more aesthetically pleasing and accessible in a wider variety of environments, differentiating them from fluorescent lighting.
To economize resources, it may be tempting to short shift the how of user experience. Satisfying functional needs is critical, but ignoring aesthetics can diminish the value a product delivers. Business software is often more utilitarian than digital consumer products. But for both types of software, it is people who use them, and how people go about their work affects the perceived value of the work they do. If a business gives its employees clumsy software, employees may interpret their work as unimportant because the business did not invest in quality tools.
Between the why and how, someone must define what the product does. In agile project management methodologies, it is common to interview users to develop user stories: a few sentences outlining a particular task the user wants to accomplish (Rehkopf 2022). The project manager uses input from stakeholders to prioritize stories, and the development team turns the highest-priority stories into tasks for implementation.
User stories ensure the product solves an actual problem for a real user. Without defining user stories, it is surprisingly easy to expend significant effort on features that do not have a definitive beneficiary.
Atlassian, a developer of project management software, recommends the following process for writing user stories (Rehkopf 2022):
To get started writing user stories, it helps to have a template. Atlassian (Rehkopf 2022) recommends a simple sentence structure: “As a [persona], I [want to], [so that].” Breaking this template down:
On some projects, the project manager interviews stakeholders and writes user stories; on others it is the UX team; on still others, it is business analysts who work directly with users and communicate between the project management and UX teams. Regardless of team structure, defining user stories is essential for establishing the inputs of why, what, and how required to deliver good UX.
We compiled quite a bit of information thus far: user stories; why, what, and how; characteristics that make our product durable, usable, and aesthetically pleasing. To capture this information and enable collaboration, we need a place to store it. For the smallest projects, we can use text files and spreadsheets. For products of any size and complexity, we need a documentation tool. This tool should offer versioning, real-time collaboration, notifications, comments, search, tables, charts, and templates for organizing information. Such a tool enables us to inform the team, make decisions, and remember why we made certain decisions.
Having documented the motivation behind our product, the obvious next step is creating a prototype. However, before thinking about implementation, we should establish a design system: clear standards that guide the creation and usage of modular components (Fanguy 2019). A design system is about more than colors and font sizes—it explains why and how to use each component and ensures different components fit together with a shared look and feel.
Think of a design system as a physical toolbox: when we encounter a job to be done, we go to our toolbox to select one or more tools to tackle the job. Each tool comes with a set of instructions. The instructions explain why each tool exists and how to use it. It may include examples of using a tool in combination with other tools. Just as manufacturer Snap-on created a set of modular rachets with the slogan “Five to do the work of Fifty” (Snap-on 2022), the standards of our design system should lead to modular components we can combine for use in a variety of scenarios.
At first, our toolbox may be sparsely outfitted—we may even have to use the blunt side of a screwdriver as a hammer. Over time, we can add more and specialized tools, allowing us to accomplish jobs more effectively. This, in turn, enables us to deliver better user experience.
When it comes to adding tools to our toolbox, it is not necessary to build everything ourselves. One way to accelerate initial product development is adopting an existing design system. Chances are, there are some challenges of building our product others have already faced. Mature design systems and their associated component libraries offer modular components for solving commonly faced problems.
Whether we leverage an existing design system or create one from scratch, we must capture its design standards in our documentation tool. For individual components, screenshots of each component adjacent to descriptions of why the component exists and how to use it will get the job done. However, it is best to use an interactive tool to bring together live components, examples, and documentation in one place. Storybook allows anyone on the team to visualize and interact with the entire design system toolbox, complete with instructions alongside each tool.
A design system not only encourages a unified look and feel, it also brings predictability to prototyping and development efforts. Because we outfit our toolbox up front, prototyping consists of modularly combining existing components rather than imagining everything from a blank canvas. To turn prototypes into code, developers reach for pre-built components, increasing the certainty with which they estimate implementation time. We should have thoroughly tested these components when adding them to the design system, so quality assurance and bug fixing cycles should become shorter as well.
With a design system in place and user stores in hand, it is time to prototype the product. In the context of user interfaces, it is common to create wireframes: a set of blueprints that help the team think and communicate about the structure of the software it is building (Guilizzoni 2022). Whether for an entirely new product or a new feature of an existing product, we create wireframes before writing any code.
Historically, a distinguishing feature of wireframes was their low fidelity. Crudely sketching a view’s structure in black and white allows for quickly incorporating feedback during multiple wireframe iterations. Once the team arrives at a structure it likes, it adds details, such as a color, animations, and precise spacing—a process known as visual design.
Assuming we have a design system in place, we collapse the wireframe and visual design processes into one step. Modern design platforms enable us to drag and drop copies of our design system components onto a prototyping canvas. We already defined the visual design of each component when establishing our design system, so there is no additional cost to adding the full-featured component to our prototype. Figma, a design platform, allows creating interactive prototypes, complete with click-through navigation, animations, and multi-layer overlays.
So far, we established UX is a multidisciplinary, qualitative process. To define bring things together in a slightly more quantitative way, we refer to Peter Morville, author of several best-selling UX books. Morville identifies seven factors that equate to good UX (Interaction Design 2022):
Assigning a weight each of the first six factors and then rating a product against them, we arrive at a value score. By rating products according to the same factors, we have a means for comparing the UX of products inside and outside our organization.
An integrated development environment (IDE) is a tool for writing and modifying source code. Commonly referred to as an editor, an IDE offers syntax highlighting, auto-completion, real-time validation, debugging, contribution history, automated refactoring, and conveniences for running terminal commands, among other efficiencies. My preferred IDE is Visual Studio Code (VS Code), is a lightweight editor with extensions for enhancing its core functionality.
According to Amazon Web Services (2022):
Source control is the practice of tracking and maintaining changes to code. Source control management (SCM) systems provide a running history of code development and help to resolve conflicts when merging contributions from multiple sources.
A SCM enables collaboration by providing tools to review and track changes to code. Git is a widely used SCM where developers pull down a complete copy of a repository on their individual machines. Git refers to a copy of the code as a branch. One branch serves as the trunk or master copy branch. From this branch, developers create additional copies of the code (i.e., branches) in which to make their changes. A collection of branches holding different variations of the same code is known as a repository. When a developer is ready to share his changes, he publishes his branch to a remote repository for review. Once the team approves the changes, the developer merges his changes into the trunk branch.
There are many branching strategies for developing, testing, and deploying code. My preference is the trunk-based workflow, where the trunk branch is ready to deploy at all times. To develop features or fix bugs, a developer creates a short-lived branch to which only he commits. Once the team approves the changes to the branch, the developer merges them to the trunk, at which point a continuous integration/continuous deployment (CI/CD) tool verifies the changes and deploys the new version of the product. The short-lived nature of branches reduces the potential for merge conflicts, and continuous deployment fits the software-as-a-service (SaaS) paradigm. Because some features cannot be developed within a day or two, we use strategies such as branch by abstraction or feature flags to enable us to merge incomplete features into the trunk branch, thereby keeping the life of branches short.
An alternative strategy is the Gitflow workflow. This workflow defines a strict branching model designed around project releases. The master branch stores the official release history, with each commit tagged with a version. A development branch serves as an integration branch for fully completed features. Feature branches exist until the feature is complete and interact only with the development branch.
When preparing for a release, we fork off the development branch and commit fixes to the release branch until it becomes stable, at which point we merge the release branch into both the master and development branches. If we need a hotfix for a released version of the product, we create a hotfix branch from master, commit the fixes, and merge the hotfix into the master branch. If the fix also applies to the development branch, we merge the hotfix branch into the development branch.
Due to the strict nature and complexity of Gitflow, it comes with a CLI tool to help the team adhere to the workflow’s rules. The Gitflow strategy is useful for managing large projects supporting multiple versions of a product simultaneously. It is not typically used in continuous deployment scenarios due to long-lived feature branches and an extensive release process.
Logging events allows us to analyze errors, track performance, identify threats, and gain insights. When we identify a bug, logs allow us to recreate the state that led to the error. By measuring performance, we can identify when modifications to code, configuration, or infrastructure impact user experience. Similarly, recording user behavior enables us to prioritize our work, identify deviations that indicate malicious behavior, and improve marketing efforts.
The software industry commonly discusses logging in terms of monitoring, alerting, and observability. Monitoring consists of gathering predefined logs and metrics. Alerting notifies interested parties when the information the system monitors deviates from set thresholds. Observability combines monitoring and alerting, giving teams the ability to debug their systems.
Log analysis software provides sorting, filtering, and alerting capabilities to make sense of our logs. We aim to avoid monitoring unnecessary data because it obscures the information we want to uncover and is costly to sift through. That said, we do want logs of relevant data, and log analysis software enables logging potentially-relevant events without having to manually review a sea of log statements. This improves observability.
Each log statement should contain metadata in addition to its payload. The format should be human readable, developer friendly, and widely supported (e.g., JSON) to avoid vendor lock-in and facilitate forensic analysis in the event of a security breach. When possible, limit the payload to a single line (approximately 80-256 characters) to improve indexing, querying, and disk compression (Splunk 2022).
Metadata
Payload
Deciding what to log and how to analyze it is the primary focus, but we also need a strategy for storing logs. Begin by writing logs to local storage. Local storage provides a buffer in the event of network latency or failure (Splunk 2022). From local storage, send logs to a service that stores them in a repository. Depending on our system design and log analysis software, we may have a centralized repository or a repository per service. Either way, we need strategies for handling writes during periods of peak throughput and for provisioning additional storage automatically to avoid running out of space. We also need plans for backing up and restoring logs. These processes ensure the information we log is available to analyze.
Linting is a form of static code analysis that identifies syntax errors, style issues, and possible bugs. Most linters have a configuration file where we define which code we want to analyze and what kind of analysis we want to perform. We may run multiple linters on the same repository.
Each programming language has its own syntax, and linters alert us to problems before we attempt to compile or interpret the code. Software teams should define a style guide to ensure the code they write is consistent and understandable by the entire team. Linters alleviate the team from having to police adherence to style guidelines by automatically detecting code that does not respect the established guidelines. Advanced linters can detect potential bugs, such as memory leaks and infinite loops. Some linters even automatically fix problems.
There is an axiom that code written without test coverage immediately becomes legacy code. Developers are loath to change legacy code for fear they overlook a hidden requirement, causing the application to stop behaving as expected. Tests document requirements and give us confidence our code works. With automated tests in place, we have the freedom to refactor and extend code and quickly receive feedback as to whether it still works after our changes.
Tooling is an integral part of test coverage. Test runners provide statistics, such as the percentage of lines covered with tests. Testing frameworks provide convenience functions that make writing tests easier. However, even if the tooling reports 100 percent test coverage, this does not necessarily mean we captured all requirements. There may be edge cases, such as extremely large values, that cause our code to break even if we have a test verifying the same line of code works with smaller values.
We should capture all requirements by writing automated tests at the time we write the code. This ensures the requirements are fresh in our minds when we capture them with test coverage. When deadlines approach, it is tempting to put off writing tests for later, however, this is exactly the time we most need test coverage. When stress is high and attention to detail short, mistakes multiply. Good discipline with test coverage helps keep mistakes under control, limiting the number of bugs developers write into the code.
There are three primary types of tests: unit, integration, and end-to-end. Unit tests are the smallest and fastest, verifying a particular unit of code works as expected. Integration tests combine two or more units of code, verifying we get the expected behavior when using them together. End-to-end tests bring up a full system and verify a given business process from start to finish.
Most test coverage should exist in the form of unit tests. If the units of code making up an integration or end-to-end test do not pass, these higher-level tests should not pass either. Because unit tests verify the smallest chunks of code, they are the fastest to run and simplest to maintain. Create integration and end-to-end tests judiciously, as they are less resilient to change and take more time to setup and run. Running tests gives us confidence our software is working, but we must economize the time it takes to write, maintain, and run tests against development and deployment velocity.
Tests are only valuable if we run them. Tooling should automatically run tests whenever a developer makes a commit, preventing the commit if the tests fail. Tests should also run whenever a developer pushes a commit to a remote branch; if any tests fail, the server tooling should notify the developer and mark the commit as failed. We might economize by running only unit tests when a developer makes a commit, while running integration and end-to-end tests, in addition to unit tests, when a developer merges to the trunk.
When software does not work as expected, we need a way to quickly identify the cause of the problem (i.e., the bug). This process, known as debugging, may be as simple as logging function calls and state to the standard output log, or it may use tools to attach to the software process, allowing us to pause function execution to examine state at demarcated breakpoints. Debugging tools allow us to shorten the time between the identification of a bug and the time we have a fix ready to deploy.
VS Code has a JavaScript debugger for setting breakpoints, stepping through functions, and examining state. To configure it, we add an entry to the .vscode/launch.json
file. This debugger even supports conditional breakpoints, which only activate if the state matches our condition.
Even the best developers make mistakes. Establishing a code review process enhances software quality by checking for things the author overlooked. This might be covering an edge case (even with 100 percent test coverage), pointing out a bug, or suggesting a better design.
Code review improves developers in addition to the software they write. Developers exchange ideas during code review, thereby leveraging the expertise of their teammates. This allows reviewers to pick up new techniques from reviewing others’ code and authors to receive feedback on areas to improve. Ideally, both parties will leverage what they learn to write better code in the future.
Code review is a form of feedback, and there are a few guidelines to keep in mind:
The guidelines above also apply to the person receiving feedback. When others take time to review our code, we should assume they have our best interest in mind, as well as that of the product. While it may be frustrating to find our code needs work, we should diligently implement valid suggestions. When we disagree, we should do so respectfully, clearly explaining our thoughts.
Tooling is vital for code review. Tools indicate what changed, provide the ability to add comments on a line-by-line basis, and enable workflows for approving or denying changes. Many tools integrate with the build system, allowing us to see whether artifacts built successfully, tests passed, and test coverage thresholds were met. When the team approves changes, tooling can merge the code with the click of a button.
It may be tempting to skip code reviews when timelines get tight. However, this is when we most need multiple sets of eyes ensuring we maintain high code quality. Stress can affect attention to detail, leading to bugs and brittle code. Be sure to consistently adhere to the code review guidelines even when it feels like time is short. It costs far less to catch and fix a bug in code review than it does in production.
A picture is worth a thousand words, and code review is no exception. When reviewing changes to user interface code, being able to see a live deployment can be as helpful as reviewing the code itself. Using a continuous integration pipeline, we can automatically deploy a test instance each time a developer generates initiates a code review. For reviewing reusable components, tools like Storybook and its cloud companion Chromatic give us visual diff tools to see what changed. These tools allow less technical team members to participate in code review by providing feedback on changes before they make it into staging or production.
Third-party libraries accelerate development by providing high-quality features we do not need to build ourselves. They can also inhibit maintainability when library maintainers do not quickly address bugs or support new operating systems, browsers, and frameworks. Often, libraries come with parts we do not use, and the design of each library dictates whether we can keep unused code from weighing down our bundle size.
Continuous integration (CI) and continuous deployment (CD) are strategies for automating parts of the software development process. CI automates development processes while CD automates operations processes. Automation occurs in stages, with each stage operating on the output from a previous stage, similar to a shell command operating on output from a previous command via the pipe operator. For this reason, the industry refers to a collection of stages as the CI/CD pipeline.
CI encourages developers to frequently merge code and alerts them of problems. When CI detects a new development branch, it automatically lints the code, runs tests, and generates a build artifact. Should anything fail, CI notifies the author of the issue, usually with a link to logs for debugging the problem. By automating the code integration process, CI gives us confidence our code does not break things.
Without CI, teams typically resort to a “merge day” strategy. On merge day, all developers working on a particular repository meet to merge their code, resolve merge conflicts, and test the resulting code. Since these meetings pause development, they tend to occur infrequently. This strategy increases the likelihood of conflicts and, by extension, mistakes resolving conflicts.
To eliminate merge days, we first define coding standards. CI enforces development standards through automation—without standards in place, there is nothing to enforce. Because CI enforces our standards, we allow developers to continually merge their changes after receiving code review approval. Merging more often reduces the probability of merge conflicts by reducing the time between developers pulling the latest code and merging in their changes.
For CI to work, we must configure linting, write tests, and configure builds to run directly from the command line. In addition, each stage must run quickly so as not to deter developers from frequently pushing their changes. Having met these requirements, we leverage CI tools to establish stages in the integration pipeline, monitor for changes, collect logs, and report the success or failure of each stage.
CD automates the release of integrated code. CI is a prerequisite to CD, because we only want to deploy validated code. The CD process starts as soon as a new commit hits the trunk or master branch. By automating the deployment process, we avoid the operations team becoming a bottleneck in the release process.
The first stage in CD is initializing the execution environment. We define the environment in code using Docker images. By defining the execution environment in code (e.g., Dockerfiles), we eliminate the one-by-one configuration of physical environments. Instead, the Docker runtime stamps out instances of the execution environment via Docker images. An instance of a Docker image is known as a Docker container. Docker containers isolate the execution environment from the host machine on which our environment runs, increasing predictability and opportunities for automation.
The amount of orchestration required to deploy code differs between libraries and services/applications. For libraries, the deployment process may involve just two stages: initializing the execution environment and running the publish command. For services, we need to replace an existing release with a new one in an automated fashion. Orchestrating this update requires additional tooling.
Kubernetes (K8s) is an orchestration platform for deploying, scaling, and managing Docker containers (as well as other container technologies). I will focus on automating updates, but K8s has numerous other features for orchestrating containerized workloads.
K8s offers load balancing, giving us the ability to run multiple instances of our services. When it comes time to release an update, we configure K8s to terminate one container of the service we want to update while we spin up a new container holding our updated code. Once K8s detects the new container is ready, it notifies the load balancer to direct traffic to the new container. K8s repeats this process until all containers within the workload, known in K8s parlance as a pod, are running the new version.
The update process described above glosses over numerous complexities. Whether the service holds state and whether the updated service API is forward and backward compatible determines the extent to which we can automate the update without disrupting users. The update process I described is known as a rolling update strategy. There are alternative strategies, such as blue/green deployment, that allow K8s to switch from one version to another all at once. Which strategy is most effective depends on the nature of each service.
Instead of deploying directly to production, it is common to first deploy to testing and staging environments. In the testing environment, stakeholders test the new version, indicating whether they accept or decline the changes. We already ran automated tests during the CI process, so this testing is a manual process. Once stakeholders accept the changes, we configure the CD pipeline to deploy to a staging environment, giving the team one last chance to verify the service is ready for production. In sophisticated CD pipelines, the approval process is part of the pipeline, allowing us to automate the deployment of the new version to staging and production as soon as stakeholders sign off.
As a final step in the CD pipeline, some teams run smoke tests: basic checks on key functionality of the updated service. The motivation behind smoke tests is to verify a new deployment is working as expected and quickly roll back if the tests fail. Rollback is yet another feature K8s can orchestrate automatically.
Exposing web services over a network disseminates their benefits to our user base, but it also exposes us to attack. To keep our services secure, we need to develop a security strategy for mitigating commonly exploited attack vectors.
The security landscape is constantly evolving, so I will not attempt to document each attack vector here. As a whole, the engineering team should familiarize itself with the Open Web Application Security Project (OWASP) Top 10 security risks. OWASP maintains descriptions, example scenarios, and mitigation measures for the most critical security risks to web services. Within the engineering team, we should dedicate specific personnel to develop and adapt our security strategy. This security team should audit our services against risks on a scheduled interval and make recommendations for addressing any security holes it identifies.
In addition to in-house security audits, teams should consider periodic third-party security audits. Professional security auditors see a variety of environments and should be familiar with both common and exotic vulnerabilities. Before the audit, we need to decide what level of testing we want the auditor to perform.
By leveraging a third-party auditor, we bring a fresh set of eyes and additional expertise.
A significant way to enhance the security of web services is adopting a third-party identity service (i.e., authentication and authorization service). While outsourcing this critical aspect of security is a risk, there are several established vendors whose services undergo regular security audits to ensure their products are secure against known exploits. Most authentication and authorization protocols have numerous steps and rules that require strict implementation. Relying on a vendor who specializes in these protocols is a better option than enlisting a few engineers who dabble in it. While I generally avoid recommending third-party services that create vendor lock-in, the cost of a security breach justifies the tradeoff.
A key practice for enhancing security is keeping infrastructure and software dependencies patched with the latest updates. This basic yet powerful step protects our services from known vulnerabilities. The work involved with updating versions may not provide benefits in terms of user-facing features, so teams often put off this effort until a critical vulnerability arises, at which point the effort to update may be quite large. To avoid this situation, we should establish a regular cadence for patching dependencies. We also need to monitor for patches of critical vulnerabilities that require immediate attention.
Software vendors typically establish end-of-life dates for their products, after which they will no longer provide security patches. We need a plan for ensuring we migrate to a supported version of each product before reaching its end-of-life date. Having the latest patch for version 12 does us no good if months ago version 12 entered end of life, leading the vendor to stop patching it.
In a similar vein, we need to plan for the expiration of SSL certificates. These certificates verify our web traffic and make encryption possible. When a certificate expires, web clients indicate the connection is insecure. Not only could this harm the reputation of our services, continuing to operate with expired certificates could accustom our users to ignoring security warnings.
Having a strategy for protection is the first step, but equally important is monitoring for attacks. Monitoring should be a combination of:
Off-the-shelf products include firewalls and distributed denial of service (DDoS) protection. Firewalls limit access to our network and alert on suspicious traffic, while DDoS protection drops nefarious requests, keeping them from overwhelming our services. Monitoring identifies attack vectors adversaries are specifically targeting and provides early notification should a breach occur. This allows us to deploy mitigation measures to limit damage.
Protection and monitoring reduce the attack surface, but the sheer number of attack vectors make it difficult to stop a determined adversary. A final aspect of a security strategy is planning for and simulating our response to a security breach. This includes restricting services to read-only access or taking them offline, restoring from backups, and communicating with internal and external stakeholders. While there will be unique circumstances surrounding a particular security breach, having a plan in place for the foreseeable aspects will help us make better decisions during what can be a stressful situation.
Hardware failures, human error, malicious actors, and weather events are common causes of lost data. We can develop excellent software, infrastructure, and documentation, but it does us no good if we lose them with no backup. In the event we do experience data loss, having a backup is not enough—we need a disaster recovery plan to minimize downtime, limiting the disruption to our employees and customers.
When developing a disaster recovery plan, we need to cover the following concepts:
If the concepts above are more than the project team wants to manage, we can consider adopting disaster recovery as a service (DRaaS). DRaaS involves a third party hosting and managing the backup and disaster recovery infrastructure. By leveraging services from a cloud provider, we can maintain a high degree of control over the backup and disaster recovery process without having to build and manage the infrastructure ourselves. Taking it a step further, there are managed service providers who will manage both the process and the infrastructure. It is common for organizations to start out using DRaaS and transition to direct management as the organization grows.
Regardless of who operates the infrastructure and process for backup and disaster recovery, we must regularly test our plan to ensure its effectiveness. Having a plan does nothing if our backups silently fail for weeks before a disaster occurs. Even when using DRaaS, our project team is the most motivated party for ensuring the backup and disaster recovery plan will work as expected. We should schedule regular tests, adjusting the frequency based on the operational cost of executing a test relative to the cost of experiencing data loss.
Performance testing evaluates how the product performs under peak load. The first step in performance testing determining what level of load each service should support. While it would be nice if each service scaled infinitely, scaling comes at a development and operational cost, so it is necessary to right-size the cost of scaling to our load expectations.
Performance needs affect almost every aspect of the software stack. Workloads that require high levels of runtime and memory performance could dictate C/C++ as the programming language as opposed to languages with automatic memory management, such as Python or C#. Big data workloads require partition-tolerant databases and a choice between consistency and high availability (i.e., CAP theorem). Transactional data requires additional management in a distributed system (e.g., ACID compliance). In-memory data stores often have different data models than application databases, which have different data models than data warehouses. Time-series data commonly requires downsampling for queries to remain performant. The request load a given service instance must handle dictates infrastructure requirements as well as load balancing, concurrency, and more. Each of these performance considerations entails enough complexity to be its own specialty.
To avoid getting bogged down with optimizations and their associated complexities, we must strike a balance between speed of development and robustness of our system. The first step is to define our current performance needs and forecast how demand may change in the future.
When starting a new project, teams commonly develop a prototype and give limited consideration to performance and other aspects of durability (e.g., testing). This strategy assumes the prototype will test a hypothesis. If the hypothesis proves true, we will throw away the prototype, taking the lessons learned building it to plan and develop a robust system. However, it is common for the rebuild phase to never occur, instead putting the prototype directly into production. This inevitably leads to countless Band-Aid fixes until it the system becomes an unreliable, inextensible mess.
When building a service, I recommend implementing it with at least the minimum level of performance we are comfortable having in production. This preempts creating a leviathan and keeps the team in the habit of adhering to development standards, including performance. Strong teams usually have members whose responsibility, at least in part, is testing against performance standards, so implementing with performance in mind also ensures we utilize our team’s resources.
Having established performance standards and implemented at least a portion of a service, it is time to test against the standards. As with any form of testing, we want to automate as much as possible. Automation frees our personnel to work on non-repetitive, value-added tasks and enables us to continuously enforce performance standards as part of our CI/CD pipeline.
The final step in performance testing is measuring performance and alerting when we detect degradation. The simplest strategy generates an alert when performance falls below a predefined threshold. More sophisticated strategies perpetually measure performance to establish a moving average and alert when performance deviates. By integrating performance into the CI/CD pipeline, we catch commits that degrade performance before they impact our users.
Amazon Web Services. (2022). What is Source Control?https://aws.amazon.com/devops/source-control/
Cartar, K. (June 28, 2019). A Comprehensive Guide to Managing Web Development Projects. Hackernoon. https://hackernoon.com/a-comprehensive-guide-to-managing-web-development-projects-8364f2230eb7
Fanguy, W. (June 24, 2019). A comprehensive guide to design systems. InVision. https://www.invisionapp.com/inside-design/guide-to-design-systems/
Guilizzoni, P. (n.d.). What are wireframes. Retrieved July 15, 2022, from https://balsamiq.com/learn/articles/what-are-wireframes/
IBM Cloud Education. (December 6, 2018). Backup and Disaster Recovery. IBM. https://www.ibm.com/cloud/learn/backup-disaster-recovery
Interaction Design. (n.d.). User Experience (UX) Design. Retrieved July 8, 2022, from https://www.interaction-design.org/literature/topics/ux-design
Rehkopf, M. (n.d.). User stories with examples and a template. Retrieved July 11, 2022, from https://www.atlassian.com/agile/project-management/user-stories
Splunk. (2022). Logging best practices in an app or add-on for Splunk Enterprise. https://dev.splunk.com/enterprise/docs/developapps/addsupport/logging/loggingbestpractices/
Snap-on. (n.d.). Snap-on History Timeline. Retrieved July 14, 2022, from https://www.snapon.com/EN/Our-Company/Our-History
The University of Chicago Library. (May 9, 2011). Firmness, Commodity, and Delight. https://www.lib.uchicago.edu/collex/exhibits/firmness-commodity-and-delight/