Resiliency, High Availability, and Disaster Recovery

The following will provide guidelines for resiliency, HA, and DR capabilities regarding Orka.

About

Orka provides orchestration and virtualization for virtualized runtime environments on macOS. It is particularly well-suited for continuous integration (CI) build and test processes.

Overview

This document outlines the built-in Resiliency, High Availability (HA), and Disaster Recovery (DR) capabilities of Orka Cluster components when hosted on MacStadium. The document also outlines additional solutions, which can be implemented with added complexity and cost to achieve more advanced setups.

It is important to note that HA/DR setups can vary significantly in complexity and cost, based on individual needs.

Orka Cluster provides substantial resiliency as part of its core offering, which is achieved by balancing effectiveness, sound expectations, and shared responsibilities between MacStadium and our customers.

Key Components

When considering resiliency, HA, and DR for Orka, we focus on three main software considerations of the system:

Control Plane that orchestrates the workloads
Images that can be instantiated as VMs
VM Runtimes that are executing the workloads

See the following diagram of these components. Orka Cluster 3 architecture diagram showing the control plane, OCI registry, and bare metal Mac host components

Each component has a different approach and set of requirements to support the availability and recoverability of the Orka environment.

Control Plane

The Control Plane is built on k8s, and is responsible for managing and distributing workloads across the system. Resiliency/HA Capabilities

When hosted by MacStadium, Orka Cluster employs an active-active control plane architecture, utilizing three VM hosts located in a single site*. If a node goes down, another node takes over while the failed node restarts.
- Active-active control plane is enabled with Orka Cluster Advanced.
This setup provides a level of resiliency and High Availability (HA) if one or two control plane nodes fail, so the system can continue orchestrating workloads without interruption.

DR Capabilities

MacStadium implements regular backups of control plane configurations, which are stored on the master nodes on the NFS mount.
In the unlikely event of a complete control plane loss, our Disaster Recovery (DR) plans include the ability to redeploy the control plane based on the most recent backup.

Images

Images are the saved state of a VM on disk that can be used to run VMs. Resiliency/HA Capabilities

When hosted by MacStadium, Orka Cluster stores images in a Pure storage array, known for its high level of resiliency. RAID capabilities provide protection against drive failures.
If you prefer more control or have existing storage solutions, Orka Cluster supports image storage on any OCI-compliant repository. You can implement High Availability (HA) solutions using the capabilities provided by the repository.

DR Capabilities

The Pure storage arrays can take backups at scheduled intervals and store data in adjacent storage for an additional price.
If you run an OCI-compliant repository, you can implement DR solutions using the capabilities provided by the repository.

Harbor OCI Storage

Starting with Orka 3.5, Harbor OCI storage is the default managed storage solution for new Orka deployments. Harbor is a MacStadium-managed, OCI-compliant image registry that provides:

Secure image storage with role-based access control
Activity auditing and compliance tracking
Prometheus metrics support
Automatic resource scaling based on available Orka nodes

Existing deployments using Pure NFS storage retain their current configuration. Customers running Harbor can implement DR solutions using the capabilities of their OCI registry. See Using Harbor OCI Storage with the Orka CLI for details.

VM Runtimes

VM Runtimes are the ephemeral macOS instances themselves. These VMs typically execute a job and return some artifacts (such as a build) when complete. Resiliency/HA Capabilities

In a failure, these VMs are quickly and easily restarted by the calling applications. The VMs run on a single macOS host/node and do not provide any built-in HA capabilities.

DR Capabilities

VMs are intended to be ephemeral and do not implement DR capabilities.

Redundant data centers

To achieve a resilient solution across multiple data centers, MacStadium works with customers to architect a solution to route traffic between multiple geographic sites. These solutions are designed for customers who need near-continuous uptime and wish to mitigate risks associated with localized disruptions. They are not out of the box and require additional design considerations with our team.

Cluster deployment across data centers

In a redundant data center configuration, you can balance workloads between data centers using a load balancer or traffic router that sits above both clusters. You can distribute and route workloads to the most suitable site based on resource availability or performance considerations. This solution supports use cases like high availability, blue/green deployments, and rolling upgrades. It also enables testing of new deployments without risking production environments and allows for a controlled fallback if any issues are encountered. In the event of a failure at one data center, the load balancer is reconfigured and redirects all traffic to the operational cluster. The details of this failover mechanism are designed to meet the SLAs of the customer.

Additional Responsibilities

Effective High Availability / Disaster Recovery is a shared responsibility between MacStadium and our customers.

Understanding these responsibilities is crucial for maintaining a strong HA/DR posture.

MacStadium Responsibilities

MacStadium is committed to ensuring that all hardware provided to the customer is operational and performing as expected.
In the event of a failure, MacStadium is prepared to rebuild an Orka Cluster environment specific to that customer. This ensures that customers can resume operations as quickly as possible.
MacStadium continuously monitors and maintains the underlying infrastructure to prevent potential issues before they impact customer operations.

Customer Responsibilities

If you store images in OCI-compliant repositories, implementing and managing DR plans for these images is your responsibility.
Maintain up-to-date documentation of your Orka Cluster configurations to facilitate faster recovery.

​About

​Overview

​Key Components

​Control Plane

​Images

​Harbor OCI Storage

​VM Runtimes

​Redundant data centers

​Cluster deployment across data centers

​Additional Responsibilities

​MacStadium Responsibilities

​Customer Responsibilities

About

Overview

Key Components

Control Plane

Images

Harbor OCI Storage

VM Runtimes

Redundant data centers

Cluster deployment across data centers

Additional Responsibilities

MacStadium Responsibilities

Customer Responsibilities