Demystifying AKS Control Plane Logs

Demystifying AKS Control Plane Logs

Welcome back to my Azure Kubernetes Service (AKS) learning journey! Over the past couple of weeks, we’ve built clusters, explored networking, and planned out IP addresses. But today, I had to ask myself a scary question: What happens when something inside the cluster breaks?

If a pod won't start, a node won't scale, or someone accidentally deletes a critical application, you need to know exactly what happened. Because Azure manages the Kubernetes "Control Plane" for us, we can't just SSH into the master nodes to look around. Instead, we have to rely on AKS Resource Logs (often called diagnostic logs).

Today, I learned that there are 7 major types of control plane logs in AKS. I’m documenting them all here, plus a super handy mental model and troubleshooting guide I put together to make sense of it all.



7 Types of AKS Logs

1️⃣ kube-apiserver

  • What it logs: Requests made to the Kubernetes API server. The API server is the main entry point for the cluster—every single operation goes through it.
  • Logged events include: kubectl commands, API calls from applications, resource creation/updates/deletions, and authentication/authorization checks.
  • Example activity: kubectl create deployment or kubectl get pods.
  • Use it for: Debugging failed API requests, checking who created or modified resources, and investigating API errors.


2️⃣ kube-audit

  • What it logs: Detailed security audit logs of Kubernetes API activity. It meticulously records who did what and when.
  • Example activity: A user created a pod, a service account deleted a deployment, or RBAC permission checks.
  • Use it for: Security monitoring, strict compliance tracking, and deep incident investigation. (Warning: This logs everything, including read operations, so it can get very noisy and expensive!)


3️⃣ kube-audit-admin

  • What it logs: This is a smarter, filtered version of the audit logs reserved for administrative actions only. It focuses exclusively on high-privilege operations.
  • Example activity: Creating or deleting namespaces, changing RBAC roles, or altering cluster-level configurations.
  • Use it for: Admin activity tracking, security audits, and privileged access monitoring without the massive data volume of the standard kube-audit.


4️⃣ kube-controller-manager

  • What it logs: Logs from the controller manager component. Controllers are the loops that constantly ensure your desired state equals your actual state (e.g., Deployment, ReplicaSet, Node, and Job controllers).
  • Example activity: Desired replicas: 3 | Current replicas: 1 | Action: create 2 pods
  • Use it for: Investigating pods that are not scaling properly, ReplicaSet issues, and general Deployment problems.


5️⃣ kube-scheduler

  • What it logs: Logs from the scheduler, the component responsible for assigning pods to nodes.
  • How it works: It evaluates CPU, memory, node labels, taints and tolerations, and affinity rules to answer the question: Which node should run this pod?
  • Example activity: Pod scheduled to node aks-nodepool1-123
  • Use it for: Troubleshooting pods stuck in a Pending state, scheduling failures, and resource constraints.


6️⃣ cluster-autoscaler

  • What it logs: Logs from the cluster autoscaler engine, which automatically adds or removes nodes based on your workload's demands.
  • Example activity: Scale up nodepool from 3 to 5 nodes (because there are too many pending pods) or scaling down due to low utilization.
  • Use it for: Debugging node scaling issues, optimizing costs, and figuring out why pending pods aren't triggering new nodes to spin up.


7️⃣ guard

  • What it logs: Logs from AKS Guard, the security component that handles Microsoft Entra ID (Azure Active Directory) integration and authorization validation.
  • Example activity: Validating user tokens, checking Azure AD permissions, or recording authentication failures.
  • Use it for: Troubleshooting login failures, identity access issues, and Azure AD authentication problems.



🧠 Simple Mental Model

To keep all of this straight in my head, I created this quick reference table. Think of it like this:

ComponentMain Responsibility
API ServerThe front door/entry point for all cluster requests.
AuditThe security tracker (who did what).
Controller ManagerMaintains the desired state (makes sure reality matches the config).
SchedulerPlaces pods onto the right nodes.
AutoscalerAdds or removes physical/virtual nodes.
GuardAuthentication and security gateway (Azure AD).


🛠️ Quick Troubleshooting Guide

If something breaks, which log should you check first? Here is the cheat sheet I will be using from now on:

ProblemLog to Check First
Pod is stuck in Pending statekube-scheduler
Nodes aren't scaling up/downcluster-autoscaler
Deployment isn't creating the right number of podskube-controller-manager
Investigating unauthorized API accesskube-audit (or kube-audit-admin)
kubectl commands are failing or timing outkube-apiserver
You can't log in / AAD identity issuesguard
Previous Post Next Post

Contact Form