Infrastructure
Our global infrastructure zones enable us to deploy and run your resources across geographic regions and cloud infrastructure providers, while maintaining strict isolation between tenants. We maintain additional infrastructure in each region to support our SaaS management components, including the administration console and API.
Our cloud-native architecture runs on highly optimized Kubernetes based infrastructure, and is delivered with infrastructure-as-code via a multi-stage delivery pipeline. We only deploy to the latest generation cloud infrastructure regions from partners including Amazon Web Services (AWS) and Microsoft Azure. We distribute all infrastructure across multiple active-active datacenters (availability-zones) in each region, and use the latest generation of storage and networking, to provide enterprise grade resilience measured against a stringent SLA.
As the underlying infrastructure has several layers, there are a number of situations that may require maintenance. Some examples are security patches and improvements to host operating systems, upgrades to container orchestration layers, improvements to network security and performance, and resource migration to next-gen platform designs.
Categories of Infrastructure Maintenance
Kaleido is built to enterprise standards, and as such it has high availability (HA) and disaster recovery (DR). Kaleido also runs it’s operations according to internationally recognized compliance standards such as the ISO27000 family of specifications. The resilience of each component of the Kaleido architecture has been considered, and all customer resources running on the platform are designed to meet our documented SLA. You can read more about Kaleido’s use of HA, multi-availability zone DR and strict data isolation between tenants here.
Wherever possible, Kaleido performs platform maintenance with zero impact to customer production workloads using active-active failover for both microservices and underlying infrastructure. However, components such as the ethereum blockchain nodes themselves do not support active-active HA, and require a failover when the runtime requires host migration or exceeds its resource limits. This is due to the current design of the open source blockchain software and is independent of Kaleido’s architecture. Our testing shows active-passive failovers of nodes result in downtime of approximately two minutes.
Maintenance Types and Details
There are several types of maintenance on our infrastructure: Continuous Delivery Platform Updates, Planned, Emergency and Automated Optimization. The following table summarizes the types
TYPE | CUSTOMER RESOURCE? | DESCRIPTION | FREQUENCY |
Continuous Delivery Platform Updates | No | Platform maintenance on layers that can be updated without requiring any downtime to customer resources. These layers are designed to be Active:Active Highly Available for zero downtime maintenance. Example: new platform releases | Coordinated with product releases |
Planned | Yes | Planned maintenance to customer infrastructure. Advanced warning provided according to estimated impact severity as defined in severity table below. Examples: Rolling upgrades to VM infrastructure for OS security patching or hardware upgrade. Orchestration layer (Kubernetes) rolling upgrades. | Maintenance windows on 2nd and 4th week of each month. Only used if needed |
Emergency | Yes | Immediate or near-immediate maintenance required. Notification provided during and after maintenance as the situation allows.Example: High Severity security vulnerability. | Ad hoc, as needed |
Automated Optimization | Yes--Starter and Team tiers only | Optimization of Starter and Team tier resources. | Ad hoc, as needed |
Planned Maintenance Windows
Planned maintenance will be performed during a revolving set of maintenance windows in which a failover of customer resources is more likely. These windows will occur will occur every 2nd and 4th week of a given month and may include actions like manually enabling optimization activities and performing regular scheduled maintenance of the VM infrastructure. Each of these planned windows will be 4 hours in duration and will occur outside the typical business hours for each region in an effort to minimize customer impact.
Though windows will be scheduled for every 2 weeks, Kaleido will often perform no maintenance during the windows when there are no required actions. If Kaleido needs to perform Planned Maintenance, then it will post notification of this via the System Status page (https://status.kaleido.io) and associated mailing list. The amount of advanced notification is proportional to the anticipated customer impact. For example, Kaleido will provide additional notice (up to 4 weeks in advance) when a planned maintenance window will involve complex operations that are more likely to impact customer resources.
Target notice windows by maintenance types
Planned Maintenance Severity / Type | Advance Notice Target | Description |
Low | 0-5 days Notification at the start and end of maintenance activities | Maintenance Activity on customer resources in the region may result in downtime. A typical expected downtime would be approximately 2 minutes per runtime (the typical time of a regular HA failover) for affected customers for this severity. |
High | 2-4 weeks Advanced Notification and notification at start and end of maintenance activities | Maintenance Activity on customer resources in the region may result in downtime. A typical expected downtime would be approximately 2 minutes per runtime but may extend longer (longer than a regular HA failover) for affected customers for this severity. |
Emergency Maintenance
Emergency maintenance may occur at any time provided there is sufficient justification to warrant such an action. Examples of appropriate justification are patching severe security vulnerabilities that exposes customers to attack, customer resources are experiencing an outage, or there is significant platform degradation that will imminently result in disruption to customer resources. Kaleido will notify customers of emergency maintenance windows and possible service disruptions by updating the status feed as soon as is practical during the incident.
Automated Platform Optimization
Some maintenance operations are automated and may occur outside of scheduled maintenance windows. These operations are designed to keep the Kaleido platform healthy without impacting any paid-for customer resources. For nodes created on the starter or developer tier, there are no SLA guarantees. Failover may occur automatically as directed by the platform in order to optimize usage on the platform. The customer does not have control of when these failovers occur, and should not use the starter or team tier for important workloads. The platform does not run automated optimization of resources created on the business or enterprise tier outside of posted maintenance windows.