Capacity Management and Planning for the Cloud Era
Capacity management and planning is the process of determining the supply of compute and other IT resources needed by an organization to meet the dynamic demands for these resources. In the context of OpenStack, in this blog I describe Startup Capacity Management, when first building out the Cloud, and Operating Capacity Management, for ongoing updates to resources for meeting user demands on the cloud deployment.
Traditional IT capacity management has employed simple forecasting-based approaches for capacity planning of dedicated IT infrastructure running limited number of applications. However, with the advent of virtualization and Cloud-based services, simple approaches no longer suffice because of the dynamic nature of supply and demand. The supply side is dynamic – VMs can be placed on any host, and can be migrated between hosts. Performance of applications can be impacted by other applications sharing the same compute, storage or network infrastructure. The demand is also dynamic, with seasonal, weekly or monthly peaks and one off dynamic peaks. A fundamental step in making the datacenter efficient is to understand the supply and demand of the IT resources. The supply consists of the resources such as compute, storage and network and the demand arises from the applications and services consuming these resources to fulfill business needs.
In this dynamic reality, one needs to continuously match supply and demand to meet the customer demands for resources and the concomitant SLAs. Analytics solutions that provide real-time infrastructure, as well as solutions that works with detailed data accumulated over weeks or months to determine long-term trending, correlations and modeling are both needed to address this complex problem. The goal of capacity management and planning is to minimize the supply-demand discrepancy, and match the supply with the demand of the resources for obtaining optimal ROI.
Startup Capacity Management and Planning
When initially building out the OpenStack deployment, the one needs to understand the current and future workload. If you have an application or service running in your data center understand the resources used by that service. Compute nodes are the central resource of your cloud environment, that run the VMs. Besides compute and memory, storage and networking play a significant role in determining the performance of the Openstack system.
Compute and Memory Planning
Initial compute and memory resources decisions must be driven with a good estimate of the workload that the Openstack deployment is expected to run. Openstack has multiple ‘flavors’ of VMs that are defined by default. These flavors range from the m1.tiny flavor (Memory: 512MB, VCPUS: 1, Storage: 0GB) to m1.xlarge (Memory: 16384MB, VCPUS: 8, Storage: 160GB). Generally users create instances of one of these predefined flavors, but could also create customized flavors with different memory and CPU settings, when the workload is well understood. The compute nodes run one or more instances of these flavors. The cloud builders need to provide hardware requirements for your workload. You need to decide how many physical cores in the server, the amount of RAM and the overcommit ratio for cpu and RAMs in your cloud. With a good resource allocation, the host CPU utilization will not exceed 70-80%, and the hypervisor will not be swapping memory. In corollary, the CPU should not be under-utilized either.
Storage performance is one of the most important factors for building a cloud. Even if I/O requirements of a single VM may be modest, aggregate I/O requirements over dozens or hundreds of VMs may cause serious resource bottleneck and performance management issues. It is important to plan out the storage I/O requirements for the virtualized environment taking into account number of VMs accessing storage, average VM load, and random and sequential read/write performance.
Booting up a new instance from Glance can require significant sequential performance for fast boot up from 20-30GB images. While this can be partially alleviated with local image caches, the instance launch speed is an important consideration for dynamic system topologies for applications that need to respond fast to demand spikes. High random I/O performance and low latency (low single digit milliseconds) in Cinder is important for performant support of applications running high-performance databases and transaction processing systems.
OpenStack supports a number of different hypervisors, including KVM, Hyper-V, ESXi, and Xen. Generally, hypervisor does not have a bearing on capacity management, but selecting one is a key decision for building an OpenStack Cloud. Hypervisor selection can be driven by many factors including licensing costs, familiarity of IT management, and support maturity and ease of management across the IT infrastructure. While most installations use a single hypervisor, it is also possible to use multiple hypervisors within an installation.
KVM, ESX and Hyper-V support most of the features required for OpenStack compute. A notable exception is the absence of Vlan, VlanManager and Routing support in Hyper-V without Quantum Hyper-V Plugin. Also, host firewall rules specification is allowed in KVM, but not in ESX, Hyper-V. Another important decision for selecting a hypervisor is the level of support and maturity of OpenStack compute drivers. Libvirt support for KVM support is more mature than other hypervisors.
Virtualization gives us the ability to create machines of the “right size” that are abstracted away from physical constraints. For example, a virtual server with ½ CPU core and 1 GB of memory can be created, and many such virtual servers can be run on a single host to more fully utilize the available resources. Overcommitting VMs to provision more capacity than the physical supports can further increase the efficiency of resource utilization. It is important to be able to compute the overcommit ratio to ensure that system performance is not adversely affected, and memory thrashing and other resource contention overhead does not occur.
Overcommitment needs to take into account not just the daily or average resource consumption, but also the peak resources used. Workloads with complementary peaks allow for maximization of overcommitment ratio. KVM and other hypervisors support kernel same page merging/dedup, memory ballooning, thin storage provisioning among others to manage this. However, static overprovisioning only can take one so far — it can’t adjust to dynamic workloads and requires significant tuning to work well. With external control and feedback systems can rapidly adjust to dynamic workload and improve overcommitment by monitoring host and guest workload. This external feedback must also take into account past statistics for effective real-time system tuning.
Physical network topology and configuration are critical to design right initially. Unlike compute capacity shortages that can be alleviated with additional computes nodes, physical network topology and configuration are harder to change. Making datacenter network changes, without destroying the instances which may be accessing block storage over the network, or providing continuously accessed services, often may not be possible.
While small OpenStack deployments may run on a 1GbE network, generally a 10GbE network is recommended for the data network to handle inter-VM and VM-storage traffic. For OpenStack compute, data network is setup between multiple hosts on a single physical network, with two NICs. Obtaining the maximum speed often requires network tuning for parameters such as maximum buffer or TCP window size.
It is important to calculate the number of instances that could be spun up in the deployment to determine the address size needed, and how many of the instances should have a publicly accessible IP address for external access to APIs and VMs as necessary. Based on OpenStack Compute Network planning guidelines – Some of the recommended sizes are: Management Network – 255 IPs, Public Network – 8 IPs, VM Data Network – 1024 IPs, Publicly routable IP addresses – minimum 16 IPs.
Static network topologies require manual intervention to deploy and migrate instances which adds cost and slows the organization’s ability to respond quickly to environmental changes. With the ability to create virtual networks dynamically, tuned for VM load support and mobility, one can have better utilization of the datacenter. Quantum provides tenants an API to build rich network topologies, and configure network policies, while providing strict tenant isolation and granular usage accounting.
Operating Capacity Management and Planning
Once the virtualized infrastructure is up and running, it is important to monitor it and ensure its smooth operation. Even the most optimal algorithms for dynamic CPU and memory over provisioning will not allow the infrastructure to be used at its full potential if the storage system or network becomes a bottleneck. As a part of operating capacity management and planning, one needs to constantly monitor CPU, memory, network and storage IO, and ensure that none of the VMs are starved for resources. This requires real-time analytics on time-series data to determine the health and stability of the running-system and make near-time decisions for load balancing. Additionally, one needs to looks at historical data — weekly, monthly and seasonal trends and also one-off peaks to determine optimal capacity, or how capacity has to incrementally added or better load balanced. This requires an analytics solution that works with detailed data accumulated over weeks or months to determine long-term trending and correlations in the deployment and create appropriate models.
The first step is to establish a baseline and determine trending. The second step is to create a model with constraints and goals for resource provisioning. What makes this problem, hard is to come up with a model that accounts for abnormal workloads, and dynamic consumption peaks and valleys. The problem is exacerbated by the large number of dependent and independent variables that need to be taken into account, from network fabric performance to IO and storage performance, to end-user applications performance. Virtualization has a way to eliminate the predictable abstractions of compute, networking and storage from applications. The noisy neighbor problem is a classic virtualization problem, that arises from VMs sharing the same shared resources as other VMs, and one VM consumes disproportionately higher resources than others. Depending on how applications treat VMs, as ‘pets’ that need to be carefully nurtured, e.g. migrated to a lightly loaded host, or as ‘cattle’, that can be stopped and restarted, different solutions to the noisy neighbor problem exist. Each of these solutions has a strong impact on application architecture, and how operating capacity is managed.
The number of datapoints to be collected across the infrastructure is high — with a single OpenStack cluster deployment with, say 100 hosts, running 1000 VMs, collecting data metrics across the hosts, VMs, storage, physical networks, and virtual networks every minute can conservatively collect 100 Billion data points a month. Analyzing this volume to data will require big data analysis for mining for information, understanding interactions such as memory ballooning vs storage IO for customer’s workload.
The models constructed can be used for what-if analysis the cloud operator can use to analyze different scenarios. Ultimately, an operating capacity management solution enables the cloud operator to activate intelligence out of data collected, so that the virtualized resource supply always meets the dynamic demand.
In summary, in the cloud era, capacity planning is a complex undertaking that has to balance supply and demand in a highly dynamic environment. Simple forecasting-based approaches of the past no longer suffice. Applications running on a VM are no longer isolated islands, and can affect each other. While virtualization has enabled organizations to consolidate their servers, virtualization and cloud architectures expose a set of new variables that need to be taken into account for a capacity management and planning and maximizing the efficiency of the underlying IT assets.
Other blogs and presentations on capacity planning for OpenStack:
- Ryan Richard’s talk from OpenStack Summit 2013.
- OpenStack capacity planning by Piotr Siwczak from Mirantis.