rtyler

Noodling on Otto’s pipeline state machine

2020-11-21T00:00:00+00:00

Recently I have been making good progress with Otto such that I seem to be unearthing one challenging design problem per week. The sketches of Otto pipeline syntax necessitated some internal data structure changes to ensure that to right level of flexibility was present for execution. Otto is designed as a services-oriented architecture, and I have the parser service and the agent daemon which will execute steps from a pipeline. I must now implement the service(s) between the parsing of a pipeline and the execution of said pipeline. My current thinking is that two services are needed: the Orchestrator and the Pipeline State Machine.

For this blog post I discuss much of what the Orchestrator should do other than to mention that I intend Orchestrators to exist to provision resources and launch agents for executing pipelines.

The Pipeline State Machine (PSM) is where the real fun starts. Somewhere inside Otto, something must keep track of the progression of a pipeline from one state to another, ensuring that the right actions are being triggered when certain state transitions occur.

States

The current structure of the internal pipeline model informs the potential states in the state machine:

---
uuid: 'some'
batches:
  - mode: Linear
    contexts:
      - uuid: 'uuid-context'
        properties:
          name: 'Build'
        environment: {}
        steps:
          - symbol: 'sh'
            uuid: 'uuid-step'
            context: 'uuid-context'
            parameters:
              - 'pwd'
  - mode: Linear
    contexts:
      - uuid: 'uuid2-context'
        properties:
          name: 'Test'
        environment: {}
        steps:
          - symbol: 'sh'
            uuid: 'uuid2-step'
            context: 'uuid2-context'
            parameters:
              - 'make test'

“Batches” are a concept which will exist internally to Otto to help handle parallel stages and other novel groupings of steps. Referring back to the “sketches of syntax” post, a parallel or fanout block would result in a single Batch. Inside that Batch would be a Context for each stage declared, allowing some flexibility between the internal representation of a Pipeline and the user-visible declaration.

I believe that each Pipeline will largely need to progress through the states defined below.

Pending: requires full model and uuid, basically the output from the Parser.
for batch in batches
- Auction Started: with list of contexts that have been auctioned
- Auction Completed: with list of contexts and the winning Orchestrator for each context.
- Provisioning: mapping each context to an Orchestrator who should be provisioning the resource(s) necessary to execute that context.
- Execution(context): each context has its own state of: Pending/Running/Failed/Aborted/Unstable/Success
- Batch Complete
Pipeline Complete(status)

“Auctions” refer to the planned resource auctioning work I wish to explore at a later date; the first version of PSM will likely omit these states.

The requirements for PSM I have in mind are:

It should receive and store the entire pipeline model (YAML above). I am not yet should what the exact interplay between source control and PSM should be. I have an issue which mentions the service which will ingest GitHub Webhook payloads. My current thinking is that this service should perhaps be responsible for handling the webhook payload and fetching the Ottofile in order to send a request to the Parser and then PSM.
It should hold the mapping between a given pipeline uuid and the states listed above.
It should fire events for each state transition.

Some requirements I am not yet certain of are:

Does it need to know which orchestrator or who is actually executing on a context or batch?
Should PSM contain a mapping of a commit revision to pipeline uuid to help with de-duplication of pipelines for identical commits?

Looking at the shape of PSM is like a inspecting a building in the distance. I have a general idea of its dimensions and key characteristics, but the details remain blurry no matter how hard I squint.

This phase of Otto’s development has certainly been the most frustrating in months. I’m pushing towards enough service integration to allow for a Otto to perform basic self-hosted CI. To accomplish this I will need:

A service ingesting webhooks and fishing the Ottofile out of the latest commit on a given branch.
☑ The Parser service, which turns an Ottofile into a usable internal model.
A Pipeline State Machine to manage the execution of the pipeline.
A basic Orchestrator which can dispatch a local otto-agent with the appropriate arguments.
☑ An object store service to contain logs, artifacts, and step libraries.
☑ An agent capable of executing steps.
☑ Defined steps to check out a source repo.

For the most basic self-hosted implementation, I don’t even think I need a GUI/dashboard or an eventbus, both of which are in the “grand vision.”

Much of what remains requires “big think” time however, which is in short supply. Every time I sit down to the problem, I spend a non-trivial amount of time debating whether I am over-complicating this before I am able to re-convince myself of the approach I am taking here.

The curse of working in Jenkins for so long is that I know how so many CI system design decisions ultimately run into limitations for certain use-cases.

Regardless of how challenging the path ahead appears, I will inch along, slowly but surely. :)

As always, if you’re curious to learn more, you’re welcome to join #otto on the Freenode IRC network, or follow along on GitHub

Sketches of syntax, a pipeline for Otto

2020-11-06T00:00:00+00:00

Defining a good continuous integration and delivery pipeline syntax for Otto is one of the most important challenges in the entire project. It is one which I struggled with early in the project almost a year and a half ago. It is a challenge I continue to struggle with today, even as the puzzles pieces start to interlock for the multi-service system I originally imagined Otto to be. Now that I have started writing the parser, the pressure to make some design decisions and play them out to their logical ends is growing. The following snippet compiles to the current Otto intermediate representation and will execute on the current prototype agent implementation:

pipeline {
  stages {
    stage {
      name = 'Build'
      steps {
        echo '>> Building project'
        sh 'make all'
      }
    }
  }
}

“It works!”

The reaction upon sharing this with friends and colleagues on Twitter was largely: “looks like Declarative [Pipeline].” The syntax above is just a simple shim to get the basics working, but the similarities are no accident. Jenkins Pipeline is the best publicly available syntax for describing a CI/CD pipeline (in my not-so-humble opinion). If I didn’t believe that I wouldn’t continue to advocate for its use at every possible turn.

For Otto however, the goal is not to create a Jenkins Pipeline knock-off. In this post I wanted to share some sketches of what I think Otto Pipelines should look like and why. For starters, the README has an incomplete list covering some high-level goals that I have for modeling continuous delivery. The examples/ directory also has a few sample .otto files which I will use for reference throughout this post.

It may also be worth reading my previous post describing Step Libraries if you haven’t already, since they play an integral part in making this syntax “go.”

With the pre-requisites out of the way, let’s walk through some syntax!

Defining the available tools

use {
  stdlib
  './common/slack'
}

Sprinkled at the top of Otto Pipelines is the use block. A common problem with Jenkins Pipeline is that the steps available in the Jenkinsfile is completely dependent on what plugins have been installed on the controller or what Shared Libraries have been configured. The use block effectively brings Otto Step Libraries into scope for the given pipeline. Because Step Libraries require no “controller-side” execution in Otto, each Otto Pipeline can use a completely different sets of steps for users to leverage in their workflow.

Open questions:

Versioning for step libraries seems like it is worth doing, but what’s the right syntax for expressing it?
Referring to step libraries by URL could be incredibly useful, but is it worth the complexity?

Defining the execution environment

Next is declaring the execution environment for a stage/stages:

/* snip */
stage {
  name = 'Build'

  runtime {
    docker {
      image = 'ruby:2.6'
    }
  }

  steps {
  }
}

Resource allocation is one of the areas I am most excited to explore with Otto, but from a modeling standpoint and an execution standpoint. Jenkins was build before “cloud” was a thing, and arguably before “containers”, depending on whether or not any rabid Solaris users are within earshot. As such it has some pitfalls when mapping pipeline execution to these much more dynamic environments. On the flip side, newer CI/CD systems seem to have all gravitated towards container-all-the-things and typically don’t consider non-container workloads in any form, and will also usually require a Kubernetes clusters just to get started.

runtime {
  arch = 'amd64'
  linux {
    pkgconfig = ['openssl', 'libxml-2.0']
  }
  python {
    version = '~> 3.8.5'
    virtualenv = true
  }
}

For Otto I want to do better and started thinking about capabilities rather than fixed labels or names. In many cases, I don’t particularly care where my Rust project builds, just so long as it has cargo and an up-to-date stable rustc. Similarly for Python projects, I might need an execution environment with Python 3.x, virtualenv, and libxml2 installed. In most systems that precede Otto, administrators end up defining complex labels which users must know. Another way to paper over this complexity is to say “just bring your own container!” which pushes a lot of work back onto developers, typically leading to one-off Dockerfiles which just take an upstream image and add one or two dependencies.

With a capabilities-oriented model, the pipeline orchestration layer is no longer looking for machines labeled “linux-python” or and then hoping one is available. Instead the orchestrator can be smarter and find any available capacity to meet the capabilities request. I believe this approach can improve on overall system performance and scheduling. An idea which I have floating around as a draft RFC right now is basically to “auction” pipeline tasks to the lowest bidder. When I first started considering this idea, I found this paper titled Efficient Nash Equilibrium Resource Allocation Based on Game Theory Mechanism in Cloud Computing by Using Auction, which will likely guide the implementation of auctioneer quite a bit.

What remains to be seen is whether users are actually interested in expressing the capabilities that would be necessary to make a highly efficient resource auction practical.

Open questions:

Do most developers think about what their pipeline needs in the same way I think about capabilities?
How would an administrator define capabilities of a cloud-based VM template?

Caching

From an operational standpoint, I think the most common problem of any CI system is overuse of remote resources by pipelines. This is not a niche problem, but rather something that affects practically everybody. Some will say “you should be caching and proxying all your remote resources!” which is simply not a practical solution for the vast majority of the users in the ecosystem. Many at users won’t be at organizations large enough to deploy such caching proxies.

stage {
  name = 'Build'

  cache {
    // Create a cache named "gems", immutable for the rest of the Pipeline
    gems = ['vendor/']
    assets = ['dist/css/', 'dist/js/']
  }

  /* snip */
}
stage {
  name = 'Test'
  cache {
    use gems
  }
}

The cache block is intended to provide pipeline authors with a way to cache arbitrary sections of the workspace for later re-use in the pipeline across multiple agents. This is a pretty simple syntax addition, but something built into the Otto infrastructure from the beginning.

On the implementation side, this requires that archiving and retrieving these artifacts is relatively quick, which I don’t believe will be a major challenge.

Open questions:

Is it sufficient to cache a file subtree and simply restore it into the same location in another agent’s workspace?
Would this syntax accommodate the caching of Docker image layers?

Composition and Re-use

Inevitably developers try to abstract common functionality and behaviors into re-usable components. Step Libraries can provide one flavor of this re-usability, but I don’t believe that it is sufficient. The ubiqitous adoption of YAML by newer CI/CD tools lead me to joke about the five stages of YAML wherein developers end up turning a declarative syntax into templates and then into just another turing-complete language.

In Jenkins we have seen numerous tools for templatizing jobs, pipelines, or other aspects of Jenkins configuration. Suffice it to say, there is a need to compose and re-use various aspects of pipelines.

For Otto, I have been playing around with a context-aware from keyword, such as below:

stage {
  name = 'Test'
  runtime {
    from 'Build'
  }
  steps {
    sh 'bundle exec rake spec'
  }
}

In the above example, from instructs the pipeline to re-use the contents of the runtime block from the Build stage. My current thinking is that this simple use of from allows for pipeline-internal re-use of pieces without the need to set variables or turn this into a scripting language.

That said, re-usability within the pipeline isn’t where the main interest in “templates” lies.

I have been exploring the concept of a “blueprint” which can act as an re-usable unit of Otto Pipeline. I am imagining that these would be published and managed similarly to Step Libraries. In order to provide maximum flexibility, I think blueprints should be able to capture just about any snippet of the Otto Pipeline syntax for re-use, consider the following example to help make common Ruby gem build/test/publish pipelines cleaner:.

rubygem.blueprint

use {
  stdlib
}

blueprint {
  parameters {
    rubyVersion {
      description = 'Specify the Ruby container'
      default = 'ruby'
      type = string
    }

    deploy {
      description = 'Push to rubygems.oorg'
      default = true
      type = boolean
    }
  }

  plan {
    stages {
      stage('Build') {
        runtime {
          docker {
            image = vars.rubyVersion
          }
        }

        steps {
          sh 'bundle install'
          sh 'bundle exec rake build'
        }
      }

      stage('Test') {
        runtime { from 'Build' }
        steps {
          sh 'bundle exec rake test'
        }
      }

      stage('Deploy') {
        gates { enter { vars.deploy } }
        runtime { from 'Build' }
        steps {
          sh 'bundle exec rake push'
        }
      }
    }
  }
}

This would then be later re-used within an Otto Pipeline by using the same from syntax as before:

pipeline {
  from 'blueprints/rubygem'

  /*
   * Optionally I could add additional post-deployment configuration here,
   * which would be ordered after the blueprint's stages have completed
   */
}

Since from would be somewhat context-aware and would be able to pull all the right stages “into place” within the pipeline. I’m optimistic that this approach would allow the definition that includes just one stage for example, or other blocks which can be defined within the pipeline { }.

I am not yet sure what the right mechanism for passing parameters into the blueprint should be. Right now I am leaning towards keyword arguments on the from directive: from blueprint: 'blueprints/rubygem', rubyVersion: '2.6', deploy: false.I am not really sure what the implementation complexity of this approach will bring however.

Open Questions:

Will treating from almost like a preprocessor directive allow the parser to successfully handle blueprints for arbitrary blocks of pipeline?
Does this amount of composition alleviate the pressure that templates tend to solve for other systmes?

Gates

The final bit of syntax I wish to discuss at the moment are “gates.” One of the least appreciated parts of just about every CD pipeline, gates define how the pipeline should behave differently under certain conditions, including pausing for user input or an external event.

From one of the modeling goals I had set:

External interactions must be model-able. Deferring control to an external system must be accounted for in a user-defined model. For example, submitting a deployment request, and then waiting for some external condition to be made to indicate that the deployment has completed and the service is now online. This should support both an evented model, wherein the external service “calls back” and a polling model, where the process waits until some external condition can be verified.

A contrived example of what this might look like for a pipeline which prepares a deployment whenever changes land in the main branch:

gates {
  enter { branch == 'main' }

  /*
  * The exit block is where external stimuli back into the system
  * should be modeled, providing some means of holding back the pipeline
  * until the condition has been met
  */
  exit {
    input 'Does staging look good to you?'
  }
}

Of anything discussed thus far, gates have the most runtime implementation requirements. In the primitive example above we have:

A Git branch being referenced, which needs to be pulled into scope somehow/somewhere.
An expression that needs to be evaluated in the service mesh before this stage of the pipeline is dispatched.
An input step which should allow the agent which executed the stage to deallocate and pause further execution of the pipeline until some external event is provided.

The last item is the most challenging for me to think about from an implementation and modeling standpoint. Somewhere within Otto a state machine for each pipeline must be maintained, and once an input, webhook, or some other step is encountered, the state machine must pause for external actions. How those external actions should be wired in? Not sure! How those steps should be defined? Not sure!

There are so many open questions at this point.

Gates leave me with the most discomfort of any of my ideas for Otto. Done well, gates could provide a key component missing from many existing tools. The challenge is going to be finding the space between the pipeline modeling language and the execution engine which will accommodate them.

I still probably have more questions than answers at this point about how the pipeline modeling syntax should be defined and how it should execute. The one major lesson which I have learned from my time in the Jenkins project is that the pipeline syntax cannot be improved in isolation from the execution environment. There are many key design decisions which need to be made in both domains which will have major repercussions in the other.

I think back to the word used by a developer who read my thoughts on what I want to do with Otto:

“Ambitious.”

As always, if you’re curious to learn more, you’re welcome to join #otto on the Freenode IRC network, or follow along on GitHub

Moving again with Otto: Step Libraries

2020-10-18T00:00:00+00:00

I have finally started to come back to Otto, an experimental playground for some of my thoughts on what an improved CI/CD tool might look like. After setting the project aside for a number of months and letting ideas marinate, I wanted to share some of my preliminary thoughts on managing the trade-offs of extensibility. From my time in the Jenkins project, I can vouch for the merits of a robust extensibility model. For Otto however, I wanted to implement something that I would call “safer” or “more scalable”, from the original goals of Otto:

Extensibility must not come at the expense of system integrity. Systems which allow for administrator, or user-injected code at runtime cannot avoid system reliability and security problems. Extensibility is an important characteristic to support, but secondary to system integrity.

Usage cannot grow across an organization without user-defined extension. The operators of the system will not be able to provide for every eventual requirement from users. Some mechanism for extending or consolidating aspects of a continuous delivery process must exist.

Starting with Jenkins and Jenkins Pipeline as a frame of reference. I do this not only because I am intimately familiar with how it works, but also because Jenkins Pipeline is the most successful and widely adopted pipeline modeling language. Key to its success are “steps.” There are a number of default steps provided by the system and new plugins introduced on the controller provide new steps for users. The “execution environment” for steps in Jenkins Pipeline is however incredibly confusing. If I were to interview a Jenkins developer or administrator, I would give them a sample Jenkinsfile and ask them to explain to me what is executing where as the pipeline progress. In essence, steps can execute code on both the controller and the agents, hopefully with users never knowing about the quirks of the runtime dance between the two.

For Otto’s pipeline language, I wanted steps to have a perfectly clear execution environment: agent only. Along with this are a number of other requirements that I have in mind:

Language-independent: I want steps to be implemented in whatever language a developer sees fit. Therefore the tooling needs remain flexible enough to distribute and execute Python-based steps as well as native compiled steps.
Statically verifiable: A step invocation in a pipeline should be verifiable without actually executing the step. That is to say, it should be known before execution whether parameters and types are correct.
Lowest necessary privilege: Steps shouldn’t be able to “know” anything about the system, credentials, configuration, etc, without an administrator or user being aware. If a step needs to access a shared configuration variable, it must self-declare that requirement. Steps should never be allowed to simply poke around in global variables or configuration of the environment.

The approach I’m settling on with “step libraries” is that each step is a package (.tar.gz) containing a manifest file and whatever other assets it requires to execute. The manifest file contains the description of the parameters, the entrypoint, and configuration values the step may require.

At runtime, the step’s entrypoint will always be invoked with a single invocation file that contains all the information necessary to execute the step correct. For this I debated a couple different approaches: setting environment variables, piping JSON data into the process, or even having the processes request a JSON payload of data from a central server. I ultimately decided on the invocation file approach since that requires the least system knowledge for the step to actually be executed by an agent.

The role of the agent in this process remains fairly simple, regardless of which steps are being executed:

Consider the steps which it should execute. (e.g. echo, sh, junit)
Retrieve the appropriate step library artifacts, originally this is going to be from a centralized store but I can easily imagine an agent retrieving “remote step libraries” in a distant future.
Unpack the step libraries
Validate that the step libraries support the parameters specified by the user’s pipeline.
Iterate through the steps and execute the entrypoint.

In this commit I managed to get something dumb and primitive working with this model. Excusing the STEPS_DIR hack to avoid needing to reach out to fetch steps, the basic test pipeline referenced in the commit contains the essence of how I believe step libraries can provide a powerful and safe extensibility model for Otto.

There are still a number of open questions I need to answer:

How will credentials be accessed by a step in a secure manner?
How will I balance the trade-off of “bring your own step libraries” with “don’t leak credentials.” Right now I’m thinking about “trusted” versus “untrusted” step libraries, and everything user-defined would be untrusted unless added to an “allow” list by an administrator.
For more complex step parameters, like files, how well will the invocation file format hold up?
How should steps affect the flow control of a pipeline? Conventionally a non-zero exit of a step will halt the pipeline in Jenkins, but is there a more granular flow control system that can be extended to steps which are defined in a step library?

Despite sparingly little free time, I am enjoying getting back into this part of Otto. I had let myself fall into a tar pit of distributed systems problems and stalled any progress with Otto. Bringing the focus back to the pipeline model and extensibility has allowed me re-focus on some of the challenges unique to the CI/CD space.

If you’re curious to learn more, you’re welcome to join #otto on the Freenode IRC network, or follow along on GitHub

Jenkins should not be the only line of defense

2019-04-15T00:00:00+00:00

This past week a missed security update contributed to a compromise at Matrix.org. As I have said before, for purposes of infrastructure design, it is prudent to consider CI/CD tools like Jenkins as “remote code execution as a service.” In the Continuous Delivery world, I think we have a serious problem with user education around securely running CI/CD tools; anything which can touch production represents a potential liability.

While these thoughts were bubbling around in my mind, I saw this tweet from Mrinal Mukherjee:

Many organisations tend to have separate non-production and production instances of a deployment orchestrator (@jenkinsci) to manage non-production and production deployments respectively. This, as opposed to a single instance which handles both use-cases. Thoughts?

In this post, I wanted to expand on my response:

I tend to prefer two systems because it is rather difficult to totally and completely secure credentials for production systems, when you give developers “Pipeline as code” :)

The “production” instance of Jenkins would typically just handle the last mile of delivery.

The unending trade-off infrastructure and tools developers must make is one of flexibility versus reliability. While it would be nice to live in a world where our automated systems allow code from individuals to fail in ways which do not adversely impact customers, for the most part we have to draw the line in the sand somewhere. Whether that is restricting access to networks, reducing the scopes of credentials, or by segmenting systems entirely. I do not view this as a problem, but a realistic approach to systems of safety.

My approach to this when structuring Jenkins infrastructure is to segment along “non-production” and “production” systems. The non-production system has non-production credentials, which have a low consequence if disclosed or misused by developers who author a Jenkinsfile. The production system however maintains production credentials, which are scoped to specific Folders or Pipelines in Jenkins, and does not process pull requests or any code not deemed fit for production, such as that in the master branch.

If you step back from Jenkins itself and consider an application which stores highly valuable secrets, what would your defense in depth strategy look like? Running any app on a hostile network requires this kind of thinking. A critical credential or bit of data living in an application which is a single bug away from being exposed is simply bad design. We take this approach seriously in the Jenkins project, because we run a Jenkins environment on a hostile network, also known as “the internet.”

In our case, there are Jenkins environments on the public internet, but the Jenkins environments which hold deployment or production credentials are simply unroutable on the public internet. By requiring a jump host or a VPN to access the environment, it is simply impossible for an attacker who might be scanning cloud provider’s address space to find and compromise the environment. There are certainly other problematic avenues, but that’s where the “defense in depth” comes in again. I’ve wrote some more tips on managing credentials in Jenkins specifically in a previous blog post: It’s not stealing when you’re giving them away. One of my favorite approaches is using tools like Hashicorp Vault which can generate secrets dynamically, making the leakage of credentials less impactful.

Regardless, it is absolutely critical to put services which have production credentials, or keys which can lead to secondary levels of compromise behind VPNs or other encrypted gateways. The public internet is a scary place, and if you launch a Jenkins instance into AWS, Google Cloud, or Azure, I guarantee it will be scanned without 10-15 minutes by script kiddies.

CI/CD tools represent an ideal attack vector not only for credentials, but for other supply-chain attacks that could further compromise your end users. Designing a layered and secure approach to running any CI/CD tool is incredibly important for everybody shipping software today. But generally, please don’t let any single application be the sole line of defense between credentials or user data, and the goblins running around on public networks.

Securely running Docker workloads in your CI/CD environment

2019-02-14T00:00:00+00:00

Over the past few years, the topic of architecture and security for CI/CD environments has become among my favorite things to discuss with Jenkins users and administrators. While security is an important consideration to include in the design of any application architecture, with an automation server like Jenkins, security is crucial in a much more fundamental way than a traditional CRUD app. Walking that fine line between enabling arbitrary use-cases from developers and preserving the integrity of the system is a particularly acute problem for CI/CD servers like Jenkins.

In one of my previous “old man yells at cloud” posts I concluded with:

People sometimes joke that Jenkins is “cron with a web UI”, but I will typically refer to it as “remote code execution as a service.” A statement which garners some uncomfortable laughs. If you’re not thinking of CI/CD systems like Jenkins, GoCD, Bamboo, GitLab, or buildbot as such, you might be sticking your head in the proverbial sand, and not adequately addressing some important security ramifications of the tool.

In this post I would like to outline some of the architectural and security-oriented decisions I made for Docker-based workloads when rebuilding ci.jenkins.io, the Jenkins project’s own Jenkins environment, in 2016.

Requirements

For the vast majority of users, I think a Jenkins environment that doesn’t support Docker is a glaring omission. Supporting container-based workloads in a CI/CD environment, even if a production environment does not utilize Docker, allows such a tremendous amount of flexibility for developers to own their build and test environment.

The Docker horse has been beaten to death at this point; I don’t have much interest in convincing people to adopt it, any more than I have a desire to convince people to adopt writing tests, use source control, or any other sensible development practices circa 2018.

Within the Jenkins project, our CI infrastructure requirements were/are loosely:

Must be able to support elastic workloads to handle the periodic “thundering herds” of re-testing Pull Requests. Some repositories, such as the git plugin have a number of outstanding Pull Requests which must be re-tested when commits are merged to the master branch, in order to ensure the commit status (green checkmark) is still valid, and master is always passing tests. In practice this means that a single merged Pull Request could create upwards of 50 Pipeline Runs at once.
Should reduce, or eliminate, the potential for Pipeline Runs to contaminate each other’s workspaces, or adversely affect the Docker environment for a subsequent Pipeline Run using that daemon.
Must allow developers to specify their own execution environment, in effect, a developer must be able to “bring their own container” without prior approval by an administrator.
Potential “container escapes” must not seriously impact the performance, security, or stability of other parts of the environment. While these are rare, they do happen as was the case with this year’s CVE-2019-5736

I don’t believe these to be necessarily unique requirements to the Jenkins project, but rather general purpose requirements for any sizable organization. That is to say, once a team or organization grows past the phase of “everybody is admin” trust, these requirements likely apply.

For purposes of discussion, imagine the following Pipeline is our typical workload, one which specifies its Docker environment, and then runs scripts inside of that environment.

pipeline {
    agent { docker 'maven:3' }
    stages {
        stage('Build') {
            steps { sh 'mvn' }
        }
    }
}

Options

The learning curve around the options in the container ecosystem can be quite steep, there are a plethora of options and not all of them are safe, secure, or reliable for “untrusted” workload requirements. The inventory in this post is not comprehensive but rather a listing of options which I have personally evaluated.

Docker: the easy, but not the smartest way

The most common pattern I have seen from Jenkins users in the wild has been to use the Docker daemon on the Jenkins master instance to run their workloads. For untrusted workloads this is a bad idea. Setting aside the potential performance impacts of running workloads on the same machine as the Jenkins master, let’s focus on the security aspect.

Jenkins stores all of its configuration, logs, and secrets on disk, usually in /var/lib/jenkins. While secrets are encrypted on disk, elsewhere on the file system, the key for decrypting those secrets is stored. In essence, this means that once an untrusted user has access to the Jenkins master’s file system, it’s as good as compromised.

When the Docker daemon (dockerd) runs, it is effectively running as root. If a user can launch a Docker container, that is functionally equivalent to granting them root access to the machine. I do not consider this a bug in Docker however, replicating the entire access control subsystem from Linux in dockerd would be impractical.

Plainly put, it is not safe to allow untrusted workloads, Docker or otherwise, to execute on the Jenkins master’s instance. We regularly advise people to set the number of executors for the master node to zero to help avoid this security pothole.

It is possible to configure Docker-based agents in Jenkins, which run on the master, but are not user-defined in Jenkins Pipeline. These can be safer, but are still susceptible to container escape vulnerabilities, and will result in performance problems as workloads and the Jenkins master compete for memory and compute time.

Docker Swarm

Another option considered was using a scalable Docker Swarm cluster for running the untrusted workload containers. What is interesting about Docker Swarm, is that it can be relatively easy to enable a cluster of machine which have the Docker engine installed. At the time when our environment was built out, it was however not mature enough for me to trust it. In addition, it didn’t quite match our infrastructure model. At no point have we had latent capacity waiting to be enabled, but rather we have had a strongly managed environment between Puppet and Jenkins.

Docker Swarm, and Kubernetes for that matter both have a usability flaw in Jenkins Pipeline. In order to use the agent { docker 'maven:3' } syntax, Jenkins needs to be able to execute docker run somewhere. But that somewhere must already be running a Jenkins agent. Unfortunately Jenkins is not smart enough at the moment to see that the Pipeline wants to run an image and use a configured orchestration engine, without the user needing to consider what Docker-in-Docker, or JNLP agent hacks might be necessary. This problem gets even more hairy if your workloads need to build Docker containers at any point. This topic is one I have devoted substantial effort to, and am happy to discuss separately, but suffice it to say for this blog post: Jenkins and orchestrators is suboptimal at best right now.

Kubernetes

Kubernetes is another option considered at the time. The Jenkins project currently runs a non-trivial amount of infrastructure in Kubernetes in production, and I am quite pleased with it. I still do not believe it is the appropriate basis for a CI infrastructure like ours, wherein we must run untrusted workloads.

First considering the performance: Kubernetes itself is relatively low-overhead, but tends to operate with a fixed-size cluster. While there are options in some public clouds to auto-scale Kubernetes, I don’t frequently see that enabled. From my experience, CI workloads are incredibly compute heavy. At one point our cloud provider contacted us to let us know that they believed some of our dynamically provisioned VMs were compromised by cryptominers. From their perspective, the behavior of a high-intensity Jenkins build looked similar to cryptomining! The manner in which Kubernetes schedules containers works very well for different types of workloads, packing one compute heavy container on a node with other containers which do not have the same requirements allows for an ideal and efficient use of compute resources. When everything will heavily utilize one dimension, such as the CPU of the underlying computer, the benefits of Kubernetes’ resource allocation dwindle.

From the security standpoint, I believe Kubernetes can be used safely for CI workloads. The mistake that I most frequently see is mixing the “management plane” (Jenkins master) with user-defined workloads (agent pods). Running both on the same Kubernetes infrastructure is a fundamental failure of isolation and will result in compromise. Any eventual bypass may allow access to the underlying Kubernetes API, from there it would be trivial to schedule new workloads, or attach the persistent volume from the Jenkins master. I do not consider this to be a theoretical problem, as my understanding of Kubernetes is that it was never designed to be a multi-tenant orchestrator. Jess Frazelle has an interesting design for one however!

Another security wrinkle arises if the cluster needs to support building of Docker containers. To the best of my knowledge, this requires either Docker-in-docker hacks, or more commonly, pass-through access to the Kubernetes node’s Docker socket. Once that socket has been passed through from the node to an untrusted container, it’s a relatively trivial exercise to use that socket to access and peek at any other workload on that specific Kubernetes node. As alluded to above in the case of using Docker on the Jenkins master: never allow untrusted workloads access to a trusted Docker socket.

This is not to say that you should never use Kubernetes with Jenkins. For internal deployments, with different threat models and trust characteristics, Jenkins and Kubernetes can work quite successfully together. As is usually the case with security and infrastructure design, the devil is in the details.

Actually Docker-in-Docker

Running Docker inside of Docker on top of the orchestrators described above was something I considered as well. At the time of the design of ci.jenkins.io, the stability of Docker-in-docker approaches was highly questionable. This may be different nowadays, and might be worth reconsidering for newer system designs.

Docker: the hard, but perhaps the most reliable way

The design that I ultimately chose, which is still in place today, I think of as “Docker the hard way.” Jenkins dynamically provisions fresh VMs in the cloud, installs Docker on them, and then launches its agent. This has numerous benefits from a security, isolation, and performance standpoint. Workloads get dedicated high-performance compute capacity, and if any of those workloads tries to do something nefarious, the impact is isolated to that single machine which is usually deprovisioned shortly after the workload has finished executing.

This isolation does come at a cost however. The time-to-available can be multiple minutes, meaning the cluster cannot rapidly grow when that “thundering herd” problem occurs. The actual infrastructure cost is also non-trivial. Our Jenkins infrastructure is the most costly part of our infrastructure right now. While anything “big and beefy” is going to be expensive in the cloud, the time-overhead to request, provision, and de-provision has a real financial impact.

Running untrusted workloads in a CI environment is not a requirement isolated to large environments like the Jenkins project. Most organizations really should treat their CI environment as if it were “untrusted”, not because there are malicious actors internally, but the same design considerations to minimize the impact of malice, also have the beneficial effect of preventing errors or incompetence from destabilizing the CI system. If a new developer in the organization, can accidentally brick the CI/CD environment, that will most certainly be disruptive and costly for the org.

There are other concerns which are not accounted for in this post, which I would like to make special mention of as they’re worth considering:

Runaway resource utilization: presently in Jenkins it is rather difficult to globally restrict how much time, or resource, a Jenkins Pipeline is able to allocate. We have strived to make it easy to developers to do the right thing, but must remain vigilant, keeping an eye out for Pipelines which have locked up or are stuck in infinite loops. While rare, these still can tie up resources, and time is money when operating in the public cloud!
Secrets management with Pipelines: inevitably some Pipelines will need an API token, or credential in order to access or push to a given system. Jenkins has some support for separating credentials but the audit and access control functionality is currently lacking, making it difficult to delegate trust in a mixed trust environment. An easy workaround is to put trusted credentials in another Jenkins environment, which is exactly what we do in the Jenkins project, but is a worthy subject of another post entirely.

Future iterations on our environment will likely incorporate a mixture of VMs and container services to balance speed and security more effectively. Not all workloads need Docker, some just need Maven, Node, etc. More efficiently balancing the disparate requirements of the hundreds of Jenkins project repositories which rely on ci.jenkins.io is slated for “version 2” of this infrastructure. :)

Overall, using containers in any CI/CD environment, at this point I would consider an absolute must. The challenge for system administrators, as it usually ends up, is balancing cost, security, and flexibility for users.

Get excited for the Continuous Delivery Foundation

2019-01-31T00:00:00+00:00

Not knowing what I was getting myself into, about eleven years ago I started contributing to what became known as the Jenkins project. What followed has been nothing short of incredible; hundreds of new contributors, tens of thousands of new users, and millions of executed pipelines. Growth is challenging. Growth means new problems which demand new solutions. Two and a half years ago I stood in front of a large group of contributors at the 2017 Jenkins World Contributor Summit and made a pitch for what I called a “Jenkins Software Foundation”, never shy to pilfer ideas from the Python community. With help from my pal Chris Aniszczyk and the Linux Foundation, the concept morphed into something far more comprehensive the Continuous Delivery Foundation (CDF), for which my colleague Tracy Miranda has been leading the charge, helping drive the founding of the CDF.

Kohsuke wrote up a good overview post for the jenkinsci-dev@ mailing list which spells out the reasons why the Jenkins project should join the Continuous Delivery Foundation once it has been established. For those interested in the Jenkins project, I encourage you to take the time to read Kohsuke’s mail if you have not already. In this post, I wanted to share some of the reasons that I am excited to help establish the Continuous Delivery Foundation (CDF).

Continuous Delivery (CD) has been an integral part of my career, something which I learned early and became passionate about, even before it was so clearly characterized by Jez Humble. I view it to be so fundamental to the practice of software development, that I have started to react like a puzzled puppy when somebody says they don’t practice CI or CD. Imagine if somebody said “eh, we’ve got a project to adopt Source Control here, but the executives aren’t really convinced yet.” Your eye would twitch and your jaw would drop. “How can any organization not use Source Control in this day and age?!” I believe CD is that fundamental to modern software development.

Continuous Delivery is also not the domain a single tool like Jenkins, but rather relies on many tools working together in concert. While I might put Jenkins at the center of it all, it is by no means the only pretty face in the picture. Unfortunately, many open source communities like Jenkins tend to have a necessarily narrower view of their world. They focus on their thing, which makes sense, but this can result in missed opportunities for incredibly valuable cross-over episodes.

Many of the tools we rely on for CD are supported wholly, or in part by different vendors as well. Jenkins receives substantial investment from CloudBees, as well as Microsoft and Red Hat to name a few. In the last five years, I have come to understand how and why foundations such as the CDF, can act as neutral territory for these different companies. By providing corporate contributors a set of guidelines, rules, and expectations, open source projects stand a much greater chance of eliciting support from them. Whether it’s advocacy, code, or cash, helping bring corporate contributors under the same neutral tent as the rest of us helps ensure the longevity of open source efforts. The added benefit of the rules set forth by the foundation is that corporate actors cannot overrun one another or individual contributors, intentionally or otherwise.

In the earlier days of free and open source projects, we deluded ourselves into thinking that everybody would read our licenses, subscribe to our “open source ethos”, file and fix issues, and contribute code back upstream. The reality is that it takes a lot more to operate large open source communities. It takes people, it takes infrastructure, and it takes money. Foundations like the CDF provide a means for organizations which depend on, or are otherwise invested in projects, to participate in a meaningful way. The Jenkins project runs on a shoe-string budget. We spend no more than $10-15k annually. If we were to tabulate the value of our donated assets, free services, or any of the other things I have managed to beg for over the past eleven years, that number would be closer to 60-80k annually. Kohsuke can attest to my ability to beg for free stuff for the Jenkins project, but free stuff is not guaranteed year to year. In order to grow, Jenkins needs a stable budget which we can invest in services and people, similar to larger foundations like the FreeBSD Foundation.

If you find yourself worried about the sustainability of open source, looking at different community homes, crowd-funding, or other ideological tools such as licensing changes, let me help you out. What makes large open source projects sustainable is a consistent budget. Because underneath it all, what makes open source projects “go” is people. Ensuring talented writers, developers, marketers, testers, and designers continue to contribute means that their employers have to invest time on their behalf, or they need to be paid through other means. I strongly believe that open source foundations provide a path for larger free and open source projects to solve that fundamental problem of budget.

The Continuous Delivery Foundation is not yet launched, but I’m already excited for its potential. Not only for the Jenkins project, but for the entire domain of continuous delivery.

It’s about time.

Crawling towards continuous delivery for Jenkins

2018-08-30T00:00:00+00:00

This year I’ve been working on an ambitious new project referred to as Jenkins Evergreen. It is ambitious in that we’re aiming to significantly alter the way in which Jenkins is downloaded, updated, and used. In most visible ways Evergreen is the same as a traditional Jenkins installation, but the way it is assembled into a package and delivered is radically different. Among the many challenges which the Evergreen project must tackle, there is one problem in common with most other organizations: how do you take a big, complex system, and make it continuously deliverable.

Long story short: very carefully.

The Old Way

Jenkins follows a pretty typical development and release process, we chat on mailing lists, open up loads of pull requests, merge some of them, and then release binary packages at prescribed intervals. Users are then expected to know an update has occurred, run some program to check for updates (apt-get update) and install the updates. From the user’s perspective, each release might contain a lot of relevant or important changes, or it might contain completely trivial ones. Depending on the release line, a bug identified and fixed may take anywhere from one to a few weeks before it’s made available. Then of course, the user must go through the update song and dance once more. This common release train model sucks for users and, I would argue, for developers too..

Jenkins has an additional complication: it is plugin-based, and all those plugins are developed and released largely independently from one another. Jenkins “core” by itself isn’t very useful at all. It is those plugins which make Jenkins a joy (and sometimes a pain) to use. For all intents and purposes, plugins also follow the release train model (which sucks for users), but with the bonus feature of requiring users to check for updates through the built-in Update Center rather than through the same distribution mechanism as Jenkins core.

Altogether, this leads to large numbers of Jenkins users never updating their systems.

Nonetheless, the release train model has helped Jenkins grow to where it is today. The times however, have changed.

Expectations around how we consume and operate our software have changed radically in the past decade. It is my steadfast opinion that the release train model is now a legacy which we should all be leaving behind.

The New Way

The model for Jenkins Evergreen is completely different. Rather than a time-based pull model (also known as the release train model), it provides an on-demand push model. As I described in the design overview document, JEP-300:

Jenkins Evergreen will be distributed as an automatically self-updating distribution, containing Jenkins core and a version-locked set of plugins considered “essential.” Rather than attempting to mirror the existing Weekly and LTS release lines for core, plus some plugin version matrix, Jenkins Evergreen will update in a manner similar to Google Chrome.

For Jenkins end users, this automatically updating distribution will mean that Jenkins Evergreen will require significantly less overhead to manage, receiving improvements and bug fixes without any user involvement.

Fundamentally, Jenkins Evergreen is about building the machinery to practice Continuous Delivery with Jenkins itself. The argument for Continuous Delivery is that smaller releases are safer than big-bang releases. Risk is amortized, and the tooling and habits of releasing often result in higher-quality software.

Jenkins needs Continuous Delivery.

How on earth do we get from the release train model (which sucks for users), to something more continuously delivered?

Very carefully!

Like most transitions to continuous delivery, Jenkins Evergreen requires a significant amount of ground work in our existing code bases before new code adopts the Evergreen distribution model.

Incremental Releases

My colleague Jesse wrote a pretty in-depth article on a new pattern we’ve introduced into the Jenkins project, generally referred to as incremental releases. Jenkins core and plugins are all Java projects which have rich Maven metadata describing their interdependencies.

In the release train model the velocity of of changes, and version bumps, required for any given plugin will be fairly minimal. In the release train model, it is okay to create a pull request to Plugin B, wait for that to be released, then update Plugin A, to depend on that change, and then wait for that to be released. In the release train model it is okay to wait for weeks on end before users see the effects of changes.

In 2018 however, that long cycle time is not okay.

Incremental releases allow for plugins to produces artifacts built from pull requests, or branches, and for those artifacts to be published to a special incrementals Maven repository. From that repository, incremental releases of artifacts can be subsequently consumed by other tooling.

In the case of Jenkins Evergreen, this allows us to craft a distribution with changes that are hot off the presses, using another foundational component: the Bill of Materials.

If you’re curious about the design of incremental releases, consult JEP-305 which outlines their design.

The Bill of Materials

Curation is a key component of any continuous delivery system. We do not necessarily want any old commit to be released all the way through to “production.” Instead we want a means to describe what versions of which components are safe to proceed through the pipeline.

As described in JEP-309, the Bill of Materials gives us a means of describing a combination of Jenkins core and plugins, which should be delivered together. This specification is currently being used by multiple parts of the Jenkins project where we have a similar need to test across multiple components and repositories. In Evergreen it is taken much further.

The Bill of Materials describes what code will be delivered to a Jenkins Evergreen instance, and the Evergreen distribution system will attempt to ensure that all instances are at the same exact version of that Bill of Materials. The Evergreen distribution system treats all instances as if they were part of as single fleet, similar to how SaaS applications are deployed.

This homogeneity addresses a fundamental problem with plugin-based ecosystems like Jenkins’s: an explosion of possible installed combinations of software across all user installations. The large variety of plugin combinations possible “in the wild” makes bug reporting and reproduction difficult, and serious pre-release acceptance testing practically impossible. In many cases, the first time certain combinations will ever be executed together will be on the user’s installation

Feedback

The final logical piece of the puzzle which any continuous delivery pipeline requires is feedback. Much as it pains me to say this, current releases of Jenkins provide no automated feedback to the Jenkins project on whether they are operating successfully. No automated crash reports. No error logs. No analytics. Nothing. The only two ways that a Jenkins contributor will ever learn about a bug in their plugin or core is if:

They see it themselves.
A user actually takes the time to manually report it.

Regrettably, this is also the case for tons of free and open source software, and it’s an absolute shame.

With Jenkins Evergreen, basic error reporting is built in by default. We have integrated with Sentry for collecting errors automatically from Jenkins Evergreen installations without any required user involvement. In the future I’m sure we’ll add more advanced feedback mechanisms, but at the moment a blurry picture of how Jenkins is running “in the real world” is tons better than flying blind.

Jenkins, like any large piece of software which has grown over a long period of time, has its flaws. After a couple beers, I could tell you about some of the skeletons in its closet, but on the whole I don’t believe Jenkins is inherently broken, or a lost cause. In fact, I believe that Jenkins is likely now more important than ever. With the practices of continuous integration and continuous delivery becoming a core part of every software project, a flexible and customizable open source tool like Jenkins is increasingly important.

Jenkins Evergreen is my vision of how we get to a better future with Jenkins. By continuously delivering Jenkins, I believe we will be able to improve the user experience, alleviate troublesome bugs, and make Jenkins even more accessible to new developers.

Enforcing administrative policy in Jenkins, the hard way

2018-01-05T00:00:00+00:00

One foggy morning a few weeks ago, I received a disk usage alert courtesy of the Jenkins project’s infrastructure on-call rotation. In every infrastructure ever, disk usage alerts seem to be the most common alert to crop up, something somewhere is not properly cleaning up after itself. This time, the alert was from our own Jenkins environment. The logging filesystem wasn’t the problem, the filesystem hosting JENKINS_HOME was perilously close to running out of space. The local time, about 6:20 in the morning, and yours truly was quietly furious at the back of a bus headed into San Francisco for the day.

To put it delicately, Jenkins has always been a pain for Systems Administrators. What was originally a huge selling point, the WYSIWYG configuration screens, over time, and thanks to the healthy adoption of “infrastructure as code” tooling such as Puppet, has become a weakness. With the introduction of “Pipeline as Code” as a core concept in Jenkins 2, circa 2016, the problem was even further exacerbated. Empowering developers with some level of code-driven autonomy is now a key aspect of any modern development tool, but without corresponding tooling and controls for administrators, such autonomy rapidly leads to chaos.

Back on the bus ride, the usage of JENKINS_HOME slowly inched towards 100%. A quick analysis indicated that most of the disk space was being occupied by what any capable Jenkins admin would expect:

Old archived artifacts.
Old test reports.
Old console logs.

With Jenkins Pipeline, developers have control. To the detriment of administrators like me, who have no (simple) means to systematically enforce things like log rotation.

That doesn’t mean administrators are left entirely out in the cold, but rather we have to enforce administrative policy the hard way.

Scripting Jenkins

Jenkins has support for built-in Groovy scripting, which is the usual solution for enforcing administrative policy in Jenkins. In order to rectify the disk usage situation, I wrote a little snippet of Groovy which will forcefully purge all but the last 5 runs of every Pipeline in the “Plugins” folder on the system:

Jenkins.instance.items.each { f ->
    if (f.name == 'Plugins') {
        f.items.each { p ->
            /* each  p is really a Multibranch Pipeline, which looks like a
             * folder, so need to iterate over its items */
            p.items.each { pipeline ->
                if (pipeline.builds.size() > 5) {
                    println "Deleting from ${p}"
                    /* Delete runs older than the last five */
                    pipeline.builds[5 .. -1].each { it.delete() }
                }
            }
        }
    }
}

Scary! Right now I have only added this little Groovy script to the infrastructure team’s runbooks. If I wanted to enforce this more systematically, I would add file to the init.groovy.d/ directory on the Jenkins master.

init.groovy.d

Many administrators aren’t aware of the init.groovy.d/ directory, which can be added to JENKINS_HOME. The really really useful characteristic of Groovy scripts added to init.groovy.d/ is that they are executed after Jenkins plugins are loaded, but before Jenkins is “ready” and starts accepting web requests or executing workloads. These qualities make init.groovy.d/ an ideal place to insert scripts which:

Clean up the filesystem, such as with my forceful log rotation script referenced above.
Enforce security policy, like my Groovy scripts which disable the Jenkins CLI, or configure GitHub OAuth-based authentication and authorization.
Configure monitoring tooling, such as the Datadog plugin
Pre-configure Pipeline Libraries, like those which should be enabled globally for all Pipelines

As I mentioned in my previous post Developing Groovy Scripts to Automate Jenkins, creating these scripts requires a lot of knowledge about how Jenkins works on the inside. While this is definitely “the hard way,” the end result is a much more automated and manageable Jenkins environment.

To learn more about scripting Jenkins, I highly recommend the talk embedded below, given by my pal Sam Gleske at Jenkins World 2017.

Scripting Pipeline

In my previous post Overriding steps in Pipeline with Shared Library sleight of hand, I discussed another option for enforcing administrative policy: overriding Pipeline steps. While I won’t repeat too much, I do wish to point out a very useful pattern to consider: enforcing timeouts on built-in steps. Take the sh step as an example, by default in Jenkins there is no built-in, configurable or otherwise, way to constrain the time spent by a step. This means a malicious or incompetent developer can run script which performs an infinite loop, wastefully tying up resources in the Jenkins environment.

By overriding the sh step, I can wrap it with a 2 hour timeout safe-guard as is implemented below. Once the Shared Library has been implicitly loaded in the Global Pipeline Libraries configuration, developers won’t notice any changes, but the beleaguered administrator will sleep a bit easier at night.

def call(Map params = [:]) {
    String script = params.script
    Boolean returnStatus = params.get('returnStatus', false)
    Boolean returnStdout = params.get('returnStdout', false)
    String encoding = params.get('encoding', null)

    timeout(time: 2, unit: HOURS) {
        /* invoke the built-in sh step */
        return steps.sh(script: script,
                    returnStatus: returnStatus,
                    returnStdout: returnStdout,
                        encoding: encoding)
    }
}
/* Convenience overload */
def call(String script) {
    return call(script: script)
}

An easier way?

Work is currently being undertaken, spear-headed by Ewelina Wilkosz at Praqma under JEP-201 titled “Configuration as Code.”

We want to introduce a simple way to define Jenkins configuration from a declarative document that would be accessible even to newcomers. Such a document should replicate the web UI user experience so the resulting structure looks natural to end user. Jenkins components have to be identified by convention or user-friendly names rather than by actual implementation class name.

While I haven’t had the time to really dive deeper into what Ewelina and her crew are proposing, they are certainly in the right ballpark for making Jenkins easier to administer, and policies easier to enforce.

Once you come to terms with scripting Jenkins, there are a number of ways in which policy can be enforced using those scripts. My current preferred method is to use init.groovy.d/, but those only apply during boot/restarts. It’s also possible to execute those very same scripts via the Jenkins CLI, which I have done in the past. Through a clever combination of shell, Groovy, and Puppet scripting, it’s possible to write idempotent scripts which Puppet can run every time the Puppet Agent runs, ensuring on-going compliance.

Just because it isn’t easy, doesn’t mean it’s impossible,