Node Reboots in AKS using Kured (and How to SSH into a Node!)
Welcome back to my Azure Kubernetes Service (AKS) learning journey! Today, I tackled a very practical operational challenge: Node Maintenance.
From a security and maintainability standpoint, it is a strict best practice to keep your Kubernetes worker nodes updated. The problem? Many OS-level updates require the virtual machine to reboot. You definitely do not want your VMs randomly rebooting in the middle of the day while serving live customer traffic!
To solve this, I learned how to use Kured (the Kubernetes REboot Daemon). It allows you to safely schedule exactly when your nodes are allowed to reboot.
Here is a breakdown of how I set it up and tested it.
What is Kured?
Kured runs as a DaemonSet in your cluster. This means Kubernetes ensures that exactly one Kured pod runs on every single worker node.
These pods constantly monitor their host VMs. Specifically, they look for a file located at /var/run/reboot-required (which Linux creates automatically when an update requires a restart). If Kured sees this file, it checks the schedule we gave it. If we are within the allowed time window, Kured will safely "cordon" the node (stop new pods from scheduling there), "drain" it (move existing pods away), and then reboot the VM.
Step 1: Installing Kured via Helm
To install Kured, I used Helm. I configured it to only run on Linux nodes and set up a very specific schedule: reboots are only allowed between 9 AM and 11 PM, Monday through Friday (because no one wants to troubleshoot a broken node on a Sunday!).
For this demo, I also set it to check for the reboot file every 1 minute (0h1m0s).
# Add the Kured Helm repository
helm repo add kured https://weaveworks.github.io/kured
# Install Kured with our custom scheduling configuration
helm upgrade kured kured/kured \
--namespace kured \
--install \
--create-namespace \
--set nodeSelector."kubernetes\.io/os"=linux \
--set configuration.startTime=9am \
--set configuration.endTime=11pm \
--set configuration.period=0h1m0s \
--set configuration.timeZone="America/Los_Angeles" \
--set configuration.rebootDays="[mo,tu,we,th,fr]"
Once installed, I verified the DaemonSet pods were running:
# Check the pods (you should see one for every node)
kubectl get pods -n kured -o wide
# Check the logs of one of the pods to see its configuration
kubectl logs kured-lntr9 -n kured
In the logs, Kured reported Reboot is not required at the moment because my VM was perfectly happy. It was time to force a reboot to see if Kured actually works!
Step 2: Hard Part - SSH into an AKS Node
To force a reboot, I needed to log into the actual worker node VM and manually create that /var/run/reboot-required file.
But wait... how do you SSH into an AKS node? Azure manages these VMs! It turns out, it's a multi-step process. You have to push your SSH keys to the underlying Virtual Machine Scale Set (VMSS), and then use a temporary "jumpbox" pod inside the cluster to connect to the node's internal IP.
Here is the exact workflow I used via the Azure Cloud Shell (https://shell.azure.com):
A. Push SSH Keys to the VMSS
First, I had to find the hidden "Infrastructure Resource Group" where Azure actually stores the VMs, and then update the VMSS with my public SSH key.
CLUSTER_RESOURCE_GROUP=$(az aks show --resource-group aks-103-rg --name myClusterName --query nodeResourceGroup -o tsv)
# Get the Virtual Machine Scale Set (VMSS) name
SCALE_SET_NAME=$(az vmss list --resource-group $CLUSTER_RESOURCE_GROUP --query [0].name -o tsv)
# Add the SSH extension to the VMSS using my local public key
az vmss extension set \
--resource-group $CLUSTER_RESOURCE_GROUP \
--vmss-name $SCALE_SET_NAME \
--name VMAccessForLinux \
--publisher Microsoft.OSTCExtensions \
--version 1.4 \
--protected-settings "{\"username\":\"azureuser\",\"ssh_key\":\"$(cat ~/.ssh/id_rsa.pub)\"}"
# Apply the update to all instances in the scale set
az vmss update-instances --instance-ids '*' \
--resource-group $CLUSTER_RESOURCE_GROUP \
--name $SCALE_SET_NAME
B. Create a Jumpbox Pod
Since the nodes only have internal IPs, I spun up a temporary Debian pod inside the cluster to act as my bridge.
# Run an interactive Debian pod
kubectl run -it --rm aks-ssh --image=debian
# Inside the Debian pod, install the SSH client
apt-get update && apt-get install openssh-client -y
C. C opy Keys and Connect
From a new terminal window on my local machine, I copied my private key into that running Debian pod.
# Copy the private key to the pod
kubectl cp ~/.ssh/id_rsa $(kubectl get pod -l run=aks-ssh -o jsonpath='{.items[0].metadata.name}'):/id_rsa
Finally, I went back to my Debian pod terminal, secured the key, found my node's internal IP (kubectl get nodes -o wide), and connected!
# Inside the Debian pod: Fix permissions and SSH into the worker node
chmod 0600 id_rsa
ssh -i id_rsa azureuser@10.240.0.4
Success! I was greeted with the azureuser@aks-default...:~$ prompt. I was officially inside my AKS worker node.
Step 3: Triggering Kured
Now that I was inside the VM, I tried running sudo apt-get update && sudo apt-get upgrade -y. However, the updates I pulled didn't actually require a reboot.
To trick Kured, I manually created the trigger file. The file doesn't need any data inside it; its mere existence is enough.
# Create the reboot trigger file
sudo touch /var/run/reboot-required
I immediately exited the SSH session and watched my cluster nodes:
kubectl get nodes -w
Almost instantly, the Kured pod on that node noticed the file. In my terminal window, I watched the node's status change to:
aks-default-123-vmss000 Ready, SchedulingDisabled
Kured had successfully cordoned the node to protect my workloads and initiated the reboot! Once the VM finished restarting, Kured automatically uncordoned it, bringing it back to a normal Ready state.
Key Takeaways
- Automate Reboots: Never leave node reboots to chance. Use a tool like Kured to define strict, safe maintenance windows.
- How it works: Kured watches for
/var/run/reboot-required, cordons/drains the node, reboots it, and then uncordons it. - SSH into AKS: It requires updating the VMSS with your public key and using an internal pod as a jumpbox. It's tedious, but incredibly useful for deep troubleshooting!
