Web Scale IT
Virtualization has been a driver for IT in the past decade. It has enabled cloud computing and has transformed the enterprise datacenter by allowing efficient utilization of resources and increased agility. However, new cloud and application architectures are transforming the software stack and the datacenter. New workloads and big data applications increase scaling demands and impose stringent operational constraints. Applications need to scale up to millions of users and must meet their service, performance and availability objectives. These applications consist of dozens or more individual programs that interact to provide services such as email, CRM or maps. Virtualization is no longer sufficient.
Applications requires a massive distributed computing infrastructure that is highly-available and scales out. The infrastructure must be cost efficient – it should enable organizations generate higher revenues, at lower operational costs. Above all, the infrastructure must allow applications to efficiently and meet their performance and availability objectives.
The attractive economics of commodity servers puts clusters of hundreds or thousands of nodes within the reach of many corporations. Combine this with a large number of cores per socket, and a single rack of servers can have thousands of threads. This is an unprecedented amount of raw compute power that needs to be orchestrated and delivered to end user applications.
Over the last decade, compute virtualization has become the standard for enterprises to efficiently utilize their IT resources. A physical server runs several virtual machines, intermediated by the hypervisor. Each virtual machine runs a copy of the entire guest operating system. Each image must contain the base operating system and the libraries and the packages needed for the application. The hypervisor provides an abstraction of the hardware to the guest operating system (hypervisor runs on the host OS for type 2 virtualization).
Containers, in contrast, are isolated namespace instances made possible by operating system virtualization. They run on the underlying operating system directly. Multiple containers can all run in their independent namespaces for process ids, networking etc., while sharing the operating system call interface. They are much more efficient than hypervisors in system resource terms without the baggage of a full guest operating system. Applications in containers can run at bare metal efficiency, and containers can be started in a fraction of a second as opposed to minutes for VMs.
This changes how applications can be deployed. With virtual machines, when a package is updated, the entire image needs to be rebuilt and application has to be restarted. In a containerized approach, application components are partitioned into separate containers. Updating a package or a component just requires one container to be updated and launched. This makes possible for containerized systems to rapidly respond to failures or changes in workload. Containers add a new level of agility to the software development lifecycle.
While Linux containers have been around for the several years, recently Docker brought them into the the mainstream dev environment by combining containerization with workflows and tooling to package and distribute apps in containers. Docker-based applications run exactly the same on a laptop as they do in production. Docker encapsulates the entire state around an application, and one does not have to worry about missing dependencies or bugs due to differences in the underlying operating system version.
Containers provide isolation based on namespaces and cgroups. Without the right protections set, application in one container can potentially access data in another. While enhanced isolation is an area of active research, security best practices recommend that containers in separate security domains run on separate hosts or virtual machines. This approach is followed by AWS container service or Google container engine.
Building Applications from Containers: Kubernetes
Building a web scale application requires a system to orchestrate dozens to hundreds or thousands of containers to deliver the application. This requires a system for managing containerized applications across multiple hosts and providing basic mechanisms for deployment, maintenance, and scaling of applications. The system should provide high utilization with admission control, efficient task scheduling and over-commitment.
It needs to supports high-availability applications with minimal fault-recovery time, and allow for scheduling policies to reduce the probability of correlated failures. Kubernetes solves many of these common problems of building such distributed containerized systems.
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. Kubernetes provides self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers. It supports robust declarative primitives for maintaining the desired state requested by the user, such as the replication factor for data. When users request Kubernetes for a cluster to run a set of containers, the system automatically chooses hosts to run those containers on. Kubernetes scheduler takes into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.
A key aspect of building these distributed systems is a resilient store for configuration data and a coordination mechanism for communication between different system components. The store holds values such as the list of machines in the cluster, or the mapping between allocated subnets and real host IP addresses, or what pods are running what services. These values can be monitored, allowing the application to reconfigure itself when they change. This store and the coordination mechanism must be resilient to failure of one or more machines, and network partitions. With CoreOS, Etcd performs this role. Etcd employs the Raft distributed consensus protocol to maintain identical logs of state updates on multiple nodes to handle multiple host and network failures.
Limitations and Challenges
While containers and Kubernetes address many of the enterprise concerns of performance, efficiency and agility, there are a few limitations in the stack that the application developer needs to account for. We describe some of these below.
Kubernetes service on master itself is stateless, so having a high availability setup for master also requires a highly available store. Kubernetes must be setup with etcd configured in a multi-master clustered setup. This is not an issue in running Kubernetes in GCE or other public cloud, and a single master can be considered highly available, since the underlying volume is a network replicated persistent disk
Containers by default are assigned an IP address that can be used to communicate with other containers on the same host. Port-mapping approaches for inter-container communication and complex and difficult to manage. The alternate is a packet-encapsulation based approach to create an virtual overlay network for enabling container-to-container communication. These networking approaches currently suffer from poor performance.
Another important requirement for building applications is the availability of containerized stateful services such as databases and object stores. Containers in the cloud have access to highly available data stores, such as network replicated persistent disk in GCE. A similar distributed scale-out solution is needed for the container stack in the enterprise. A few solutions are in the works, but this area is rife for innovation.
Containers currently suffer from limitations that make it difficult to setup external DNS. Random IP assignment makes it difficult to bind services to external names, and there are limitations in specifying IP addresses manually. Non-trivial networking expertise is required to roll out a virtual network setup.