CONTENTS
What is a service mesh? How does it work? Why would you want a service mesh in your application and what can it provide? Get a quick overview of service mesh and Kubernetes.
One of the biggest challenges in developing cloud native applications today is speeding up the number of your deployments. Shorter, and more frequent deployments offer the following benefits:
But with more frequent releases, the chances of negatively affecting application reliability or customer experience also increases. It’s important that operations and DevOps teams develop processes to automate deployment strategies that minimize risk to the product and customers.
A few weeks ago, Stefan Prodan delivered a talk on Service Meshes for Kubernetes and how they are used for advanced deployment strategies like Progressive Delivery for the Weave Online User Group.
In software architecture, a service mesh is a dedicated infrastructure layer for facilitating service-to-service communications between microservices, often using a sidecar proxy. Although this definition sounds very much like a CNI implementation on Kubernetes, there are some differences. A service mesh typically sits on top of the CNI and builds on its capabilities. It also adds several additional capabilities like service discovery and security.
The components of a service mesh include:
Service mesh allows you to separate the business logic of the application from observability, and network and security policies. It allows you to connect, secure, and monitor your microservices.
The example shown below illustrates a Kubernetes cluster with an app composed of these services: a front-end, a backend and a database.
The blue arrows represent the traffic that comes into your cluster through an ingress gateway. This ingress gateway can be anything from NGINX to a cloud based one like ELB.
In this case, we’re also using an egress gateway which should be used in your cluster for better security. The red arrows indicate east - west traffic or the traffic that occurs between your services.
If you are not using a service mesh and you’re implementing a plain vanilla Kubernetes cluster instead, these are the problems you’ll run into:
By default, there is no security between services. Even if you are using Weave Net nothing is encrypted on the network layer. Weave Net encrypts traffic between nodes, but if you have two services running on the same node, those will not be encrypted.
To mitigate this risk, you could of course use TLS certificates for communications between all of your services. But doing so means more work for your SRE team, since they will need to rotate and manage TLS certificates. Your dev team will also need to integrate TLS into each service and many other tasks. In the end, it’s no small feat to implement TLS across your cluster.
Most service meshes have a goal of end-to-end encryption, which can save time for your teams. A service mesh injects a sidecar with a TLS certificate into each pod. Control planes will also come with a certificate authority that rotates the certificates for you.
Another problem with a plain vanilla cluster is that it can be difficult to troubleshoot the source of a problem. For example a latency issue may be particularly difficult to trace by only looking at the data from a single service. Any analytics data you are reading in this case may not have anything to do with the service that communicates to the outside world. The problem may instead, reside in a malformed database query or it could be a problem in the front-end, you don’t know for sure.
Without a service mesh, you can solve this kind of problem by instrumenting your code and then measuring your requests between each service in your application. But if you use a service mesh like Istio that has distributed tracing built right in, you won’t have to worry about the extra step of code instrumentation.
With a service mesh, all of the traffic is routed through ingress and egress through a proxy sidecar. The proxy sidecar then adds tracing headers to a request. When a request comes through the ingress gateway to the front-end that goes to the backend, you will have a trace for all of those requests without having to instrument your code.
Because metrics are built into a service mesh, you can take advantage of more advanced load balancing strategies. For example, the front-end can be scaled up when it has more traffic and you can also pinpoint other traffic bottlenecks more easily.
Not all of the services meshes out there have all of these capabilities, but in general, these are the features you gain:
There are three leading contenders in the Kubernetes ecosystem for Service Mesh. All of these solutions are open source. Each solution has its own benefits and downfalls, but using any of them will put your DevOps teams in a better position to thrive as they develop and maintain more and more microservices.
Istio | Linkerd v2 | Consul | |
---|---|---|---|
Supported Workloads | Does it support both VMs-based applications and Kubernetes? | ||
Workloads | Kubernetes + VMs | Kubernetes only | Kubernetes + VMs |
Architecture | The solution’s architecture has implications on operation overhead. | ||
Single point of failure | No – uses sidecar per pod | No | No. But added complexity managing HA due to having to install the Consul server and its quorum operations, etc., vs. using the native K8s master primitives. |
Sidecar Proxy | Yes (Envoy) | Yes | Yes (Envoy) |
Per-node agent | No | No | Yes |
Secure Communication | All services support mutual TLS encryption (mTLS), and native certificate management so that you can rotate certificates or revoke them if they are compromised. | ||
mTLS | Yes | Yes | Yes |
Certificate Management | Yes | Yes | Yes |
Authentication and Authorization | Yes | Yes | Yes |
Communication Protocols | |||
TCP | Yes | Yes | Yes |
HTTP/1.x | Yes | Yes | Yes |
HTTP/2 | Yes | Yes | Yes |
gRPC | Yes | Yes | Yes |
Traffic Management | |||
Blue/Green Deployments | Yes | Yes | Yes |
Circuit Breaking | Yes | No | Yes |
Fault Injection | Yes | Yes | Yes |
Rate Limiting | Yes | No | Yes |
Chaos Monkey-style Testing | Traffic management features allow you to introduce delays or failures to some of the requests in order to improve the resiliency of your system and harden your operations | ||
Testing | Yes- you can configure services to delay or outright fail a certain percentage of requests | Limited | No |
Observability | In order to identify and troubleshoot incidents, you need distributed monitoring and tracing. | ||
Monitoring | Yes, with Prometheus | Yes, with Prometheus | Yes, with Prometheus |
Distributed Tracing | Yes | Some | Yes |
Multicluster Support | |||
Multicluster | Yes | No | Yes |
Installation | |||
Deployment | Install via Helm and Operator | Helm | Helm |
Operations Complexity | How difficult is it to install, configure and operate | ||
Complexity | High | Low | Medium |
Any of these service meshes will solve your basic needs. The choice comes down to whether you want more than the basics.
Istio has the most features and flexibility of any of these three service meshes by far, but remember that flexibility means complexity, so your team needs to be ready for that.
For a minimalistic approach supporting just Kubernetes, Linkerd may be the best choice. If you want to support a heterogeneous environment that includes both Kubernetes and VMs and do not need the complexity of Istio, then Consul would probably be your best bet.
Has a Go control plane and uses Envoy as a proxy data plane. Istio is a complex system that does many things, like tracing, logging, TLS, authentication, etc. A drawback is the resource hungry control plane, says Stefan. The more services you have the more resources you need to run them on Istio.
This is a managed control plane that also uses an Envoy proxy for its data plane. You don’t have to run it yourself on your cluster. It works very similar to Istio. Since it’s fairly new and it still lacks many of the features that Istio has. For example it doesn’t include mTLS or traffic policies.
Also has a Go control plane and a Linkerd proxy data plane that is written in Rust. Linkerd has some distributed tracing capabilities and just recently implemented traffic shifting. The current 2.4 release implements the Service Mesh Interface (SMI) traffic split API, that makes it possible to automate Canary deployments and other progressive delivery strategies with Linkerd and Flagger. The Linkerd roadmap also shows that many other new features will be implemented over the next year.
Uses a Consul control plane and requires the data plane to managed inside an app. It does not implement Layer 7 traffic management nor does it support Kubernetes CRDs.
Progressive delivery is Continuous Delivery with fine-grained control over the blast radius. This means that you can deliver new features of your app to a certain percentage of your user base.
In order to control the progressive deployments, you need the following:
Canary
A canary is used for when you want to test some new functionality typically on the backend of your application. Traditionally you may have had two almost identical servers: one that goes to all users and another with the new features that gets rolled out to a subset of users and then compared. When no errors are reported, the new version can gradually roll out to the rest of the infrastructure.
While this strategy can be done just using Kubernetes resources by replacing old and new pods, it is much more convenient and easier to implement this strategy with a service mesh like Istio and an add-on like Flagger which can automate the shift in traffic.
Join the newsletter to receive the latest updates in your inbox.