Effortless Multi-cloud Management: Secure Access to VMs and Kubernetes Workloads

Feb 08, 2024

SSH(Secure Shell) is one of the most ubiquitous tools in the modern computing world. It has been around for a long time and commonly used on all Linux machines. While it works really well, there are some operational burdens and security concerns that comes with it such as -

Opening SSH ports to the target VM(Virtual Machine).
SSH key management including a list of who has access to which VMs.
What commands are people running on the VMs when they access it?
How do you protect against the dangers of SSH key theft?

Overview

As SSH has been around for a long time, there are some common tools and concepts that can somewhat deal with these concerns, but they can be very complicated and a huge overhead to manage. Some of the key concerns are -

SSH port on a target VM can be kept secured without opening it to the internet by using Bastion hosts(jump servers), VPN(Virtual Private Networks) or a combination of the two. While this reduces the number of open ports, we still need to open up some ports for our VPN or the bastion host itself. Any vulnerabilities in the VPN server or the bastion host can still end up exposing the resources.
There are also some tools for this like Teleport or Netflix’s Bless which can handle some of these concerns. Many companies are using SSH certificates which tools like Netflix’s Bless helps make it easy. These tools can also protect us from SSH key theft but are complicated to manage.
Linux utilities like “script” can log what is happening on the VM if you login with SSH.

To mitigate the concerns and management overhead that comes with SSH and the tools in its ecosystem, we looked for some cloud native modern solutions that would give us easy, secure and auditable access. Some of the requirements and good-to-haves we had are listed below.

Accessing VMs without opening ports within our cloud networks
Strong auditing features including session audit log file management.
Compatibility with config management tools like Ansible.
An easy way to segregate our VMs and add contextual information based on the Yellow.ai’s region it belongs to.
Support accessing VMs in on-premise data centers the same way we connect to cloud based VMs.

Solution

After evaluating various options, we at Yellow.ai’s Infra Team ended up going with AWS SSM(AWS Systems Manager Agent) which has a lot of the features that we wanted.

As the SSM agent creates an outbound connection to the AWS Systems Manager services and receives commands, no inbound ports would have to be opened for us to connect to the VMs. Only outgoing traffic would have to be allowed.
Systems manager also supports running SSM agents on VMs in other public clouds and private data centers. This would ensure we can use the same steps to access a VM irrespective of where it is.
As all our traffic goes to SSM which AWS ensures is secure, we do not have the overhead of maintaining a centralized server that supports the features we need.
The SSM agent logs all the commands that are run on the managed instances and ships them to S3(Amazon Simple Storage Service). This ensures that strong auditing is possible while also simplifying the management of the logs.
We can centrally ensure that any session has a max duration and automatically a session closes if no traffic is being sent through it.
SSM also supports other features allowing custom script execution using SSM Documents. This lets us extend it and use it for running more complex workflows and operations such as running standardized scripts to setup a VM, patch os vulnerabilities and ensure all compliance tasks are automatically getting executed.
Segregation of VMs under different regions/deployments could be solved by using tags which is a native way to do it in AWS(Amazon Web Services)

To make it easier to use SSM while ensuring relevant information is retained for auditing purposes, we decided to build CAP (Central Access Platform) which is a contextual abstraction over AWS Systems Manager’s Session Manager. CAP gives us programmatic access to execute commands and create browser based interactive sessions to VMs and Kubernetes Clusters. It adds metadata to the managed instances so we know which deployment/environment a VM/Cluster belongs to and creates SSM sessions with user related metadata when an engineer wants to access a VM. Admins and Infosec teams can fetch logs for any sessions. It also has RBAC(Role Based Access Control) which ensures that only Admins can run commands that require sudo access. This lets us change the permissions an engineer has very easily while also helping us with our security and compliance needs.

There are two ways an engineer can access a VM -

The first is a browser based direct shell that places an engineer within the VM. It is almost instantaneous and great for when an oncall engineer needs to perform some actions. The functionality is limited to shell based commands only and can not be used to access the applications or services running on the VM directly over a TCP/UDP port.
The second is a browser based Ubuntu desktop box we call CAB (Central Access Box) which can access the VMs. A CAB is useful when an engineer wants to perform some maintenance or operations on multiple VMs which involves additional tooling. A CAB comes with various built in tools like Tableplus, MongoDB Compass, Chrome, Firefox and infractl. infractl is a cli tool which can create sessions and port forward for the users.

An example use case for a CAB is performing Database migrations on db nodes. The engineer can use infractl and port forward to access services on the VM from a CAB so an engineer can use the built in tools.

A CAB is allowed to access VMs only from the region it was launched for. This also ensures a user can not accidentally copy data from a VM in one Yellow.ai region to another Yellow.ai region.

To ensure we had a similar experience for Kubernetes, we took inspiration from EKS(Amazon Elastic Kubernetes Service) connector and adopted SSM to work on our clusters. The k8s connector is a pod running on the cluster which together with CAP handles creating a user and assigning the permissions they require based on RBAC. We are now able to connect to our clusters the same way we connect to the VMs. Kubectl is built in this pod and commands can be run like normal. We have the same features of session logging here which is easier to deal with than the kubernetes event logs. A positive side effect of this method of connection is that all the API calls to the cluster control plane are so much faster than using Kubectl locally.

As part of the provisioning, the Infra Team onboards a resource on CAP. Then when an engineer needs to access a resource, they are able to select the resource they want and connect. A common workflow used for debugging workflow is shown below

CAP has been in place and getting used for nearly two years now and has helped us many times during incidents and audits.

What’s next?

As all these functionalities are consumed through APIs, We have been able to extend these to build an Infra Platform that offers other functionality which are used by various teams at Yellow.ai. We rely on CAP and SSM APIs to patch VMs, run workflows, and check the current state of a workload on a Kubernetes cluster. We will cover more of these in another blog

Tech @ Yellow.ai

Discussion about this post