Challenges Encountered in PR driven workflows for Terraform Collaboration

GitHub PR Automation for Terraform - Challenges
GitHub PR automation for Terraform - Challenges

Terraform has gained significant popularity over the last few years. With increased adoption amongst SRE, platform and DevOps practioners, many teams have encountered challenges in setting up effective PR driven workflows for Terraform collaboration. In this article, we will explore some of the pain points that arise during collaboration with Terraform in PR driven workflow, and emphasize the importance of planning your workflow tools in advance.

Since Terraform is stateful in nature, effective state management emerges as a pivotal consideration. Initially, Terraform's demonstration of the "plan-apply" workflow on a single machine may appear straightforward. However, as teams expand and responsibilities are distributed among team members, overseeing the state file becomes increasingly intricate. This file serves as a repository for the most recent status of cloud resources and necessitates regular updates.

To mitigate the risks associated with Terraform's stateful nature, many teams adopt a pull request-driven workflow. Pull requests trigger plan and apply actions against multiple workspaces. However, this approach introduces challenges such as competing plans, the need for serialised applies, the need for parallelism (otherwise slowing down build times by hours) and the user management/RBAC (Who can do what?)

Let’s go over each one in depth now:

Competing Plans:

In a typical collaborative environment, different team members or contributors may be working on separate branches or feature branches, each containing Terraform configurations representing changes to the infrastructure. When these branches are turned into PRs, Terraform generates an execution plan for each PR independently.

The problem arises when multiple PRs involve changes to the same or overlapping infrastructure resources. These changes might include creating, modifying, or deleting resources or making updates to resource attributes. Each PR's execution plan is based on the current state of the infrastructure and the proposed changes within that PR. When multiple PRs are merged in rapid succession or simultaneously, they trigger Terraform to execute their respective plans concurrently. This results in competing plans, where Terraform tries to make changes to the same resources concurrently. Competing plans can lead to conflicts and unpredictability. For example:

Resource State Conflicts - If two PRs try to modify the same resource, Terraform may not be able to reconcile which change should take precedence, leading to conflicts.

Resource Deletion Issues -  If one PR deletes a resource while another PR attempts to modify it, Terraform might fail to identify the correct sequence of actions, potentially leaving the resource in an inconsistent state.

Race Conditions - Concurrent execution of plans can result in race conditions where the order of operations isn't well-defined, causing unexpected outcomes.

To address the issue of competing plans in Terraform collaboration workflows, teams often implement strategies to serialize the execution of plans. This involves establishing rules and mechanisms to ensure that only one plan is applied at a time. Some common approaches include Manual coordination, Locking mechanisms that prevent multiple PRs from applying their plans concurrently and sequential execution where each PR's plan is applied only after the previous one has successfully completed.

The need for serialised applies:

In a collaborative environment, different team members may be working on various Terraform configurations and creating separate PRs to propose changes. These PRs can involve modifications to the same or overlapping resources in the infrastructure. If multiple PRs are merged and trigger Terraform apply simultaneously, there is a risk of concurrent operations on the same resources. This can lead to conflicts, unpredictable behavior, and even errors during the application process. To mitigate these risks, teams often implement a serialized approach. This means that only one PR is allowed to trigger the Terraform apply process at a time. The idea is to ensure that changes are applied sequentially rather than concurrently.

The (dire) need for parallelism

Say an organization or team has a significant number of AWS accounts, with each account representing a distinct project or environment. These accounts might be used for different purposes, such as development, testing, staging, and production. For each AWS account or project, there is a dedicated Terraform repository or project. These repositories contain Terraform configurations that define the infrastructure resources to be provisioned within the corresponding AWS account. When a change is made to the infrastructure code, it often needs to be applied to all AWS accounts. This could be due to the need to roll out a common configuration change, security update, or any other infrastructure adjustment.

The challenge arises because most workflows are configured to execute Terraform plans and applies sequentially across all AWS accounts. In other words, it processes one account at a time, waits for the completion of the plan and apply actions in one account before moving on to the next. With a growing number of AWS accounts and potentially complex infrastructure configurations, this sequential execution can become a performance bottleneck. It can lead to longer deployment times, slower response to changes, and increased maintenance overhead.  The number of AWS accounts and associated Terraform projects are ever increasing, further exacerbating the performance challenges.

To address these challenges and optimize the workflow for managing multiple AWS accounts with Terraform, Instead of processing AWS accounts sequentially, the team can reconfigure the workflow to run Terraform plan and apply operations in parallel across multiple AWS accounts simultaneously.  They can also optimise their CI/CD pipelines to leverage parallelism and automation to efficiently manage Terraform projects in different AWS accounts.

Who can do what? (AKA RBAC in the enterprise world)

Role-Based Access Control (RBAC) in Terraform is essential for managing multiple projects and environments within repositories. RBAC enables fine-grained access control to ensure only authorized users or teams can propose and apply changes to specific environments. For example, in a Dev environment, anyone on the team may propose and apply changes freely, while in QA, changes can be proposed by anyone but require oversight for application. In the Prod environment, applying changes may be restricted to a designated team or authorized individuals. Implementing RBAC involves integrating an Identity Provider (IdP), defining roles, creating access policies, configuring repository settings, establishing pull request workflows that enforce RBAC, setting up notifications and alerts and maintaining audit trails.

In a PR-driven Terraform workflow, Role-Based Access Control (RBAC) can be established to regulate access to different environments within a repository. This is achieved by combining the capabilities of version control systems (VCS) like GitHub or GitLab with Terraform-specific practices.

Access permissions in the VCS platform are configured to specify who can initiate pull requests (PRs), open branches, and merge code. PR templates and guidelines provide contributors with clear instructions on RBAC policies. Automated checks in the CI/CD pipeline examine PR code changes to ensure they align with RBAC requirements.

Code reviews with designated approvers and defined RBAC configurations within Terraform codebase further enforce RBAC policies. Policy as Code (PaC) tools can be employed to codify and automate RBAC rule enforcement. Additionally, notifications, alerts, testing environments, and audit logs contribute to the RBAC workflow's transparency, security, and accountability, allowing for well-regulated access control in a PR-centric Terraform environment.

This article was written by Digger. Digger is an Open Source IaC orchestrator that enables you to run Terraform & OpenTofu within your CI/CD securely. Do check out the repo here or join our Slack!