Demystifying AKS Control Plane Logs
Welcome back to my Azure Kubernetes Service (AKS) learning journey! Over the past couple of weeks, we’ve built clusters, explored networking, and planned out IP addresses. But today, I had to ask myself a scary question: What happens when something inside the cluster breaks?
If a pod won't start, a node won't scale, or someone accidentally deletes a critical application, you need to know exactly what happened. Because Azure manages the Kubernetes "Control Plane" for us, we can't just SSH into the master nodes to look around. Instead, we have to rely on AKS Resource Logs (often called diagnostic logs).
Today, I learned that there are 7 major types of control plane logs in AKS. I’m documenting them all here, plus a super handy mental model and troubleshooting guide I put together to make sense of it all.
7 Types of AKS Logs
1️⃣ kube-apiserver
- What it logs: Requests made to the Kubernetes API server. The API server is the main entry point for the cluster—every single operation goes through it.
- Logged events include:
kubectlcommands, API calls from applications, resource creation/updates/deletions, and authentication/authorization checks. - Example activity:
kubectl create deploymentorkubectl get pods. - Use it for: Debugging failed API requests, checking who created or modified resources, and investigating API errors.
2️⃣ kube-audit
- What it logs: Detailed security audit logs of Kubernetes API activity. It meticulously records who did what and when.
- Example activity: A user created a pod, a service account deleted a deployment, or RBAC permission checks.
- Use it for: Security monitoring, strict compliance tracking, and deep incident investigation. (Warning: This logs everything, including read operations, so it can get very noisy and expensive!)
3️⃣ kube-audit-admin
- What it logs: This is a smarter, filtered version of the audit logs reserved for administrative actions only. It focuses exclusively on high-privilege operations.
- Example activity: Creating or deleting namespaces, changing RBAC roles, or altering cluster-level configurations.
- Use it for: Admin activity tracking, security audits, and privileged access monitoring without the massive data volume of the standard
kube-audit.
4️⃣ kube-controller-manager
- What it logs: Logs from the controller manager component. Controllers are the loops that constantly ensure your desired state equals your actual state (e.g., Deployment, ReplicaSet, Node, and Job controllers).
- Example activity:
Desired replicas: 3 | Current replicas: 1 | Action: create 2 pods - Use it for: Investigating pods that are not scaling properly, ReplicaSet issues, and general Deployment problems.
5️⃣ kube-scheduler
- What it logs: Logs from the scheduler, the component responsible for assigning pods to nodes.
- How it works: It evaluates CPU, memory, node labels, taints and tolerations, and affinity rules to answer the question: Which node should run this pod?
- Example activity:
Pod scheduled to node aks-nodepool1-123 - Use it for: Troubleshooting pods stuck in a
Pendingstate, scheduling failures, and resource constraints.
6️⃣ cluster-autoscaler
- What it logs: Logs from the cluster autoscaler engine, which automatically adds or removes nodes based on your workload's demands.
- Example activity:
Scale up nodepool from 3 to 5 nodes(because there are too many pending pods) or scaling down due to low utilization. - Use it for: Debugging node scaling issues, optimizing costs, and figuring out why pending pods aren't triggering new nodes to spin up.
7️⃣ guard
- What it logs: Logs from AKS Guard, the security component that handles Microsoft Entra ID (Azure Active Directory) integration and authorization validation.
- Example activity: Validating user tokens, checking Azure AD permissions, or recording authentication failures.
- Use it for: Troubleshooting login failures, identity access issues, and Azure AD authentication problems.
🧠 Simple Mental Model
To keep all of this straight in my head, I created this quick reference table. Think of it like this:
| Component | Main Responsibility |
| API Server | The front door/entry point for all cluster requests. |
| Audit | The security tracker (who did what). |
| Controller Manager | Maintains the desired state (makes sure reality matches the config). |
| Scheduler | Places pods onto the right nodes. |
| Autoscaler | Adds or removes physical/virtual nodes. |
| Guard | Authentication and security gateway (Azure AD). |
🛠️ Quick Troubleshooting Guide
If something breaks, which log should you check first? Here is the cheat sheet I will be using from now on:
| Problem | Log to Check First |
Pod is stuck in Pending state | kube-scheduler |
| Nodes aren't scaling up/down | cluster-autoscaler |
| Deployment isn't creating the right number of pods | kube-controller-manager |
| Investigating unauthorized API access | kube-audit (or kube-audit-admin) |
kubectl commands are failing or timing out | kube-apiserver |
| You can't log in / AAD identity issues | guard |
