Automatic Scaling with the Kubernetes

Automatic Scaling with the Kubernetes Horizontal Pod Autoscaler (HPA)


In previous article, we deployed a simple PHP-Apache application and learned how to scale it manually.

But as I quickly realized, manual scaling is basically useless in the real world. Traffic spikes happen when you least expect them, and you don't want to be waking up at 3 AM to type kubectl scale into your terminal.

Today, I explored the absolute magic of Kubernetes automation: the Horizontal Pod Autoscaler (HPA).

Here is exactly how I set it up and watched it save my cluster from a simulated traffic flood!


Step 1: Generating the Fake Traffic

To test the autoscaler, we first need some traffic. I deployed a new YAML file called a "Load Generator." This is essentially a pod containing a script that endlessly pings our php-apache service in a tight loop.

I started by deploying just two of these load generator pods:

Bash

# load-generator-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: load-generator
spec:
  selector:
    matchLabels:
      run: load-generator
  replicas: 100 # 2
  template:
    metadata:
      labels:
        run: load-generator
    spec:
      containers:
      - name: load-generator
        image: busybox
        args: [/bin/sh, -c, 'while true; do wget -q -O- http://php-apache; done']


Bash
kubectl apply -f load-generator.yaml

I gave it a few seconds, then checked how my application pods were handling the pressure using the metrics server:

Bash
kubectl top pods

Result: My two PHP-Apache pods were sitting around 255m and 377m of CPU utilization.

If you remember from yesterday, our pods requested 200m of CPU but had a hard limit of 500m. The load was definitely increasing, but we hadn't hit the ceiling yet.

To make things interesting, I edited my load generator configuration to spin up 10 instances of the traffic-spiking pods and applied it again.

When I ran kubectl top pods this time, my application pods were redlining at 501m—hitting the absolute CPU limit. Kubernetes was starting to throttle them. It was time for the HPA to step in.


Step 2: Configuring the HPA

The Horizontal Pod Autoscaler is a native Kubernetes object, just like a Pod or a Deployment. You define it using a YAML manifest.

Here is the hpa.yaml file I used:

YAML
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  minReplicas: 3
  maxReplicas: 10 
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

The Golden Rule of HPA Math

The most important thing I learned today is how the HPA calculates that 50 percent target.

The HPA bases its math on your pod's requests, NOT its limits. Since my pod requested 200m of CPU, a target of 50% means the HPA wants the average CPU usage across all pods to stay around 100m. Because my pods were currently screaming at 500m, the HPA was going to see a massive spike and take immediate action.


Step 3: Watching the Magic Happen

I deployed the HPA to my cluster:

Bash
kubectl apply -f hpa.yaml

Next, I ran a command to monitor the HPA's thought process:

Bash
kubectl get hpa

Initial Output:

Plaintext
NAME             REFERENCE                   TARGETS             MINPODS   MAXPODS   REPLICAS
php-apache   Deployment/php-apache   <unknown>/50%   3                 10                  2

(It takes about 15-20 seconds for the metrics server to gather the data, hence the <unknown> status at first).

A minute later, I ran the command again:

Plaintext
NAME             REFERENCE                   TARGETS    MINPODS   MAXPODS   REPLICAS
php-apache   Deployment/php-apache   220%/50%   3                  10                 10

Boom. The HPA saw that our average CPU utilization was a massive 220% of our target goal. It immediately spun up new replicas, instantly hitting my defined maxReplicas limit of 10!

When I checked the actual pods (kubectl top pods), I could see all 10 replicas running, each pulling around 240m to 290m of CPU to share the massive load we generated.


Pushing the Limits

Because the average CPU was still well above my 50% target, I decided to take the training wheels off. I opened my hpa.yaml, changed maxReplicas: 10 to maxReplicas: 1000, and re-applied the file.

Within seconds, the HPA recognized the new ceiling and started provisioning heavily. The replica count jumped to 20, then 30, and eventually settled around 35 pods running simultaneously to properly balance the traffic!


Beyond CPU: Using Custom and External Metrics

While scaling based on CPU and memory is standard, I discovered that the HPA is incredibly flexible. By using adapters (like the Prometheus Adapter or the Azure Monitor Adapter), you can tell the HPA to scale based on almost anything!

Inside the transcript I was studying, there were some amazing commented-out examples of what you can do with custom metrics:

  • Network Traffic: Scale based on packets-per-second hitting your pods.
  • HTTP Requests: Scale based on the exact number of requests-per-second hitting your Ingress controller.
  • External Cloud Services: My favorite feature! If you connect the AKS adapter for Azure Monitor, you can have your Kubernetes pods scale based on the number of messages waiting in an Azure Service Bus Queue. If the queue backs up to 30 unread messages, AKS spins up a new worker pod to process them. How cool is that?


So, the HPA successfully watched my pods and scaled them from 1 instance all the way up to 35 to handle the load.

But wait... all of those new pods require physical CPU and memory from my underlying Azure Virtual Machines. If I check kubectl top nodes, my three worker nodes are completely exhausted. I have plenty of pods, but nowhere left to put them!

How do we scale the actual nodes in the cluster? That is the job of the AKS Cluster Autoscaler, and that is exactly what I will be tackling in next post!

Previous Post Next Post

Contact Form