CONTENTS

You've successfully subscribed to The DevOps Bootcamp 🚀
Great! Next, complete checkout for full access to The DevOps Bootcamp 🚀
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info is updated.
Billing info update failed.

Introduction to Service Meshes on Kubernetes

What is a service mesh? How does it work? Why would you want a service mesh in your application and what can it provide? Get a quick overview of service mesh and Kubernetes.

Yiadh TLIJANI
Yiadh TLIJANI

One of the biggest challenges in developing cloud native applications today is speeding up the number of your deployments. Shorter, and more frequent deployments offer the following benefits:

  • Reduced time-to-market.
  • Customers get new functionality faster.
  • Customer feedback flows back into the product team faster, which means the team can iterate on features and fix problems more quickly.
  • More features in production makes for a happier development team.

But with more frequent releases, the chances of negatively affecting application reliability or customer experience also increases. It’s important that operations and DevOps teams develop processes to automate deployment strategies that minimize risk to the product and customers.

A few weeks ago, Stefan Prodan delivered a talk on Service Meshes for Kubernetes and how they are used for advanced deployment strategies like Progressive Delivery for the Weave Online User Group.

What is a Service Mesh?

In software architecture, a service mesh is a dedicated infrastructure layer for facilitating service-to-service communications between microservices, often using a sidecar proxy. Although this definition sounds very much like a CNI implementation on Kubernetes, there are some differences. A service mesh typically sits on top of the CNI and builds on its capabilities. It also adds several additional capabilities like service discovery and security.

The components of a service mesh include:

  • Data plane - made up of lightweight proxies that are distributed as sidecars. Proxies include NGINX, or envoy; all of these technologies can be used to build your own service mesh in Kubernetes. In Kubernetes, the proxies are run as cycles and are in every Pod next to your application.
  • Control plane - provides the configuration for the proxies, issues the TLS certificates authority, and contain the policy managers. It can collect telemetry and other metrics and some service mesh implementations also include the ability to perform tracing.

How is a service mesh useful?

Service mesh allows you to separate the business logic of the application from observability, and network and security policies. It allows you to connect, secure, and monitor your microservices.

  1. Connect: Service Mesh enables services to discover and talk to each other. It enables intelligent routing to control the flow of traffic and API calls between services/endpoints. These also enable advanced deployment strategies such as blue/green, canaries or rolling upgrades, and more.
  2. Secure: Service Mesh allows you secure communication between services. It can enforce policies to allow or deny communication. E.g. you can configure a policy to deny access to production services from a client service running in development environment.
  3. Monitor: Service Mesh enables observability of your distributed microservices system. Service Mesh often integrates out-of-the-box with monitoring and tracing tools (such as Prometheus and Jaeger in the case of Kubernetes) to allow you to discover and visualize dependencies between services, traffic flow, API latencies, and tracing.

The example shown below illustrates a Kubernetes cluster with an app composed of these services: a front-end, a backend and a database.

The blue arrows represent the traffic that comes into your cluster through an ingress gateway. This ingress gateway can be anything from NGINX to a cloud based one like ELB.

In this case, we’re also using an egress gateway which should be used in your cluster for better security. The red arrows indicate east - west traffic or the traffic that occurs between your services.

service-mesh-traffic.png

Problems solved by service mesh

If you are not using a service mesh and you’re implementing a plain vanilla Kubernetes cluster instead, these are the problems you’ll run into:

No security between services

By default, there is no security between services. Even if you are using Weave Net nothing is encrypted on the network layer. Weave Net encrypts traffic between nodes, but if you have two services running on the same node, those will not be encrypted.

To mitigate this risk, you could of course use TLS certificates for communications between all of your services. But doing so means more work for your SRE team, since they will need to rotate and manage TLS certificates. Your dev team will also need to integrate TLS into each service and many other tasks. In the end, it’s no small feat to implement TLS across your cluster.

service-mesh-traffic-overview.png

Most service meshes have a goal of end-to-end encryption, which can save time for your teams. A service mesh injects a sidecar with a TLS certificate into each pod. Control planes will also come with a certificate authority that rotates the certificates for you.

Tracing a service latency problem is difficult

Another problem with a plain vanilla cluster is that it can be difficult to troubleshoot the source of a problem. For example a latency issue may be particularly difficult to trace by only looking at the data from a single service. Any analytics data you are reading in this case may not have anything to do with the service that communicates to the outside world. The problem may instead, reside in a malformed database query or it could be a problem in the front-end, you don’t know for sure.

Without a service mesh, you can solve this kind of problem by instrumenting your code and then measuring your requests between each service in your application. But if you use a service mesh like Istio that has distributed tracing built right in, you won’t have to worry about the extra step of code instrumentation.

With a service mesh, all of the traffic is routed through ingress and egress through a proxy sidecar. The proxy sidecar then adds tracing headers to a request. When a request comes through the ingress gateway to the front-end that goes to the backend, you will have a trace for all of those requests without having to instrument your code.

Load balancing is limited

Because metrics are built into a service mesh, you can take advantage of more advanced load balancing strategies. For example, the front-end can be scaled up when it has more traffic and you can also pinpoint other traffic bottlenecks more easily.

What a service mesh provides?

Not all of the services meshes out there have all of these capabilities, but in general, these are the features you gain:

  1. Service Discovery (eventually consistent, distributed cache)
  2. Load Balancing (least request, consistent hashing, zone/latency aware)
  3. Communication Resiliency (retries, timeouts, circuit-breaking, rate limiting)
  4. Security (end-to-end encryption, authorization policies)
  5. Observability (Layer 7 metrics, tracing, alerting)
  6. Routing Control (traffic shifting and mirroring)
  7. API (programmable interface, Kubernetes Custom Resource Definitions (CRD))

Service Mesh Options for Kubernetes:

There are three leading contenders in the Kubernetes ecosystem for Service Mesh. All of these solutions are open source. Each solution has its own benefits and downfalls, but using any of them will put your DevOps teams in a better position to thrive as they develop and maintain more and more microservices.


Consul Connect

Consul is a full-feature service management framework, and the addition of Connect in v1.2 gives it service discovery capabilities which make it a full Service Mesh. Consul is part of HashiCorp’s suite of infrastructure management products; it started as a way to manage services running on Nomad and has grown to support multiple other data center and container management platforms including Kubernetes.

Consul Connect uses an agent installed on every node as a DaemonSet which communicates with the Envoy sidecar proxies that handles routing & forwarding of traffic.

Architecture diagrams and more product information is available at Consul.io.


Istio

Istio is a Kubernetes-native solution that was initially released by Lyft, and a large number of major technology companies have chosen to back it as their service mesh of choice. Google, IBM, and Microsoft rely on Istio as the default service mesh that is offered in their respective Kubernetes cloud services.

Istio was the first to include additional features that developers really wanted, like deep-dive analytics.

Istio has separated its data and control planes by using a sidecar loaded proxy which caches information so that it does not need to go back to the control plane for every call. The control planes are pods that also run in the Kubernetes cluster, allowing for better resilience in the event that there is a failure of a single pod in any part of the service mesh.

Architecture diagrams and more product information is available at Istio.io.

 


Linkerd

Linkerd is arguably the second most popular service mesh on Kubernetes and, due to its rewrite in v2, its architecture mirrors Istio’s closely, with an initial focus on simplicity instead of flexibility. This fact, along with it being a Kubernetes-only solution, results in fewer moving pieces, which means that Linkerd has less complexity overall. While Linkerd v1.x is still supported, and it supports more container platforms than Kubernetes; new features (like blue/green deployments) are focused on v2. primarily.

Linkerd is unique in that it is part of the Cloud Native Foundation (CNCF), which is the organization responsible for Kubernetes. No other service mesh is backed by an independent foundation.

Architecture diagrams and more product information is available at  Linkerd.io.

 

Comparison between k8s service mesh technologies


Istio

Linkerd v2

Consul

Supported WorkloadsDoes it support both VMs-based applications and Kubernetes?
WorkloadsKubernetes + VMsKubernetes onlyKubernetes + VMs
ArchitectureThe solution’s architecture has implications on operation overhead.
Single point of failureNo – uses sidecar per podNoNo. But added complexity managing HA due to having to install the Consul server and its quorum operations, etc., vs. using the native K8s master primitives.
Sidecar ProxyYes (Envoy)YesYes (Envoy)
Per-node agentNoNoYes
Secure CommunicationAll services support mutual TLS encryption (mTLS), and native certificate management so that you can rotate certificates or revoke them if they are compromised.
mTLSYesYesYes
Certificate ManagementYesYesYes
Authentication and AuthorizationYesYesYes
Communication Protocols
TCPYesYesYes
HTTP/1.xYesYesYes
HTTP/2YesYesYes
gRPCYesYesYes
Traffic Management
Blue/Green DeploymentsYesYesYes
Circuit BreakingYesNoYes
Fault InjectionYesYesYes
Rate LimitingYesNoYes
Chaos Monkey-style TestingTraffic management features allow you to introduce delays or failures to some of the requests in order to improve the resiliency of your system and harden your operations
TestingYes- you can configure services to delay or outright fail a certain percentage of requestsLimitedNo
ObservabilityIn order to identify and troubleshoot incidents, you need distributed monitoring and tracing.
MonitoringYes, with PrometheusYes, with PrometheusYes, with Prometheus
Distributed TracingYesSomeYes
Multicluster Support
MulticlusterYesNoYes
Installation
DeploymentInstall via Helm and OperatorHelmHelm
Operations ComplexityHow difficult is it to install, configure and operate
ComplexityHighLowMedium

Any of these service meshes will solve your basic needs. The choice comes down to whether you want more than the basics.

Istio has the most features and flexibility of any of these three service meshes by far, but remember that flexibility means complexity, so your team needs to be ready for that.

For a minimalistic approach supporting just Kubernetes, Linkerd may be the best choice. If you want to support a heterogeneous environment that includes both Kubernetes and VMs and do not need the complexity of Istio, then Consul would probably be your best bet.

Istio

Has a Go control plane and uses Envoy as a proxy data plane. Istio is a complex system that does many things, like tracing, logging, TLS, authentication, etc. A drawback is the resource hungry control plane, says Stefan. The more services you have the more resources you need to run them on Istio.

AWS App Mesh

This is a managed control plane that also uses an Envoy proxy for its data plane. You don’t have to run it yourself on your cluster. It works very similar to Istio. Since it’s fairly new and it still lacks many of the features that Istio has. For example it doesn’t include mTLS or traffic policies.

Linkerd v2

Also has a Go control plane and a Linkerd proxy data plane that is written in Rust. Linkerd has some distributed tracing capabilities and just recently implemented traffic shifting. The current 2.4 release implements the Service Mesh Interface (SMI) traffic split API, that makes it possible to automate Canary deployments and other progressive delivery strategies with Linkerd and Flagger.  The Linkerd roadmap also shows that many other new features will be implemented over the next year.

Consul Connect

Uses a Consul control plane and requires the data plane to managed inside an app. It does not implement Layer 7 traffic management nor does it support Kubernetes CRDs.

How does progressive delivery work with a service mesh?

Progressive delivery is Continuous Delivery with fine-grained control over the blast radius. This means that you can deliver new features of your app to a certain percentage of your user base.

In order to control the progressive deployments, you need the following:

  • User segmentation (provided by the service mesh)
  • Traffic shifting Management (provided by the service mesh)
  • Observability and metrics (provided by the service mesh)
  • Automation (service mesh add-on like Flagger)

Canary

A canary is used for when you want to test some new functionality typically on the backend of your application. Traditionally you may have had two almost identical servers: one that goes to all users and another with the new features that gets rolled out to a subset of users and then compared. When no errors are reported, the new version can gradually roll out to the rest of the infrastructure.

canary-deployment.png

While this strategy can be done just using Kubernetes resources by replacing old and new pods, it is much more convenient and easier to implement this strategy with a service mesh like Istio and an add-on like Flagger which can automate the shift in traffic.

Yiadh TLIJANI

Founder of The DevOps Bootcamp | Senior DevSecOps Engineer | Certified Kubernetes Developer, Administrator and Security Specialist