Monitoring Kubernetes with Prometheus and Grafana Outside the Cluster

Introduction

Monitoring Kubernetes is an essential activity designed to ensure that, as far as possible, your applications, services, and infrastructure run optimally and efficiently. While most of the guides available online focus on deploying monitoring tools inside a Kubernetes cluster, this article will demonstrate how to set up Prometheus (for metric collection) and Grafana (for visualization of data) in a VM, outside the confines of a single node cluster.

Relying on a dedicated VM instead of deploying the whole monitoring stack inside the Kubernetes cluster offers several notable advantages. First, such an architecture isolates the monitoring toolset from the cluster and makes them extremely resilient and operationally highly available in case of cluster failure or malfunction. They also eliminate the risk of exhausting the resources in the cluster since they're running completely separate from the cluster itself.

This tutorial will guide you through the step-by-step process of deploying Prometheus and Grafana on a VM for monitoring Kubernetes through service discovery.

Why Monitor Kubernetes?

Kubernetes manages dynamic, distributed workloads, but its complexity demands visibility. Without monitoring, you risk:

Resource bottlenecks: Unchecked CPU/memory usage can crash nodes.
Silent failures: Pod crashes, network errors, or hung deployments might go unnoticed.
Cost overruns: Overprovisioned resources or idle workloads waste money.
Scaling failures: Autoscalers rely on metrics to add/remove pods or nodes.

Monitoring tools like Prometheus and Grafana act as your "central nervous system," providing real-time insights into cluster health, application performance, and resource efficiency.

Why Run Prometheus and Grafana Outside Kubernetes?

Resilience:
- If your Kubernetes cluster crashes (e.g., API server failures), your monitoring tools remain operational to diagnose issues.
- Avoid a "circular dependency" where monitoring tools inside the cluster can’t report their own failures.
Resource Isolation:
- Prometheus and Grafana won’t compete with Kubernetes workloads for CPU/memory.
- Example: A memory-intensive application won’t starve Prometheus, preventing metric blackouts.
Simpler Maintenance:
- Upgrade or restart monitoring tools without impacting Kubernetes.
- Avoid managing Helm charts, operators, or Custom Resource Definitions (CRDs).
Security:
- Limit Kubernetes API access to read-only permissions.
- Reduce exposure to cluster-internal threats (e.g., compromised pods).
Multi-Cluster Support:
- A single Prometheus/Grafana instance can monitor multiple clusters.

Setup Overview

Deploy a VM with access to your Kubernetes cluster.
Install Prometheus and Grafana directly on the VM.
Configure Prometheus to discover Kubernetes components.
Visualize metrics in Grafana with prebuilt dashboards.

Prerequisites

A Linux VM (Ubuntu 20.04+ recommended) with:
- 2+ CPU cores, 4GB+ RAM, 20GB+ disk space.
- Network access to your Kubernetes API server.
kubectl configured on the VM to access your cluster.

Step 1: Install Prometheus on the VM

1.1 Download and Extract Prometheus

I have an arm64 architecture system, for other architecture visit github.com/prometheus/prometheus/releases

wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-arm64.tar.gz
tar -xvf prometheus-3.2.1.linux-arm64.tar.gz
cd prometheus-3.2.1.linux-arm64

1.2 Create System User and Directories

sudo useradd --no-create-home --shell /bin/false prometheus  
sudo mkdir /etc/prometheus /var/lib/prometheus  
sudo cp prometheus promtool /usr/local/bin/

1.3 Configure Prometheus to Discover Kubernetes

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  # Kubernetes API Server
  - job_name: 'kubernetes-apiservers'
    scheme: https
    bearer_token_file: /etc/prometheus/token
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets: ['192.168.71.143:6443']

  # cAdvisor
  - job_name: 'cadvisor'.  # Capturing container level metrics
    scheme: https
    bearer_token_file: /etc/prometheus/token
    tls_config:
      insecure_skip_verify: true
    kubernetes_sd_configs:
      - role: node
        api_server: 'https://192.168.71.143:6443' #Kubernetes API IP:Port
        bearer_token_file: /etc/prometheus/token
        tls_config:
          insecure_skip_verify: true
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: '192.168.71.143:10250'  #NodeIP and Kubelet Port.
        source_labels: [__meta_kubernetes_node_name]
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /metrics/cadvisor

  # Node Exporter
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: endpoints
        api_server: 'https://192.168.71.143:6443'
        bearer_token_file: /etc/prometheus/token
        tls_config:
          insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: 'node-exporter'
      - source_labels: [__address__]
        replacement: '192.168.71.143:31672' #Kubernetes Node IP and NodePort for node-exporter service.
        target_label: __address__

  # Kube-State-Metrics
  - job_name: 'kube-state-metrics'
    kubernetes_sd_configs:
     - role: service
       api_server: 'https://192.168.71.143:6443'
       bearer_token_file: /etc/prometheus/token
       tls_config:
          insecure_skip_verify: true
    relabel_configs:
      - action: keep
        source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
        regex: kube-state-metrics  # Adjust this to match your service's labels
      - source_labels: [__address__]
        target_label: __address__
        replacement: 192.168.71.143:31673   #Kubernetes Node IP and NodePort for Kube-state-metrics service

1.4 Set Permissions

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

1.5 Create a Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]  
Description=Prometheus  
Wants=network-online.target  
After=network-online.target  

[Service]  
User=prometheus  
Group=prometheus  
ExecStart=/usr/local/bin/prometheus \  
  --config.file=/etc/prometheus/prometheus.yml \  
  --storage.tsdb.path=/var/lib/prometheus/  

Restart=always  

[Install]  
WantedBy=multi-user.target

But, how Prometheus gets access to the Kubernetes Kingdom?

To access Kubernetes and scrape metrics, Prometheus needs proper authorization, often provided via a Service Account. This service account is assigned a token that enables Prometheus to securely communicate with the Kubernetes API server. Additionally, you need to construct a ClusterRole and a ClusterRoleBinding to allow the service account the appropriate rights for scraping metrics from the Kubernetes nodes and services. While the ClusterRoleBinding ties the service account to this role, granting it cluster-wide access, the ClusterRole specifies the operations Prometheus can perform, such as reading pod and node metrics. Once set up, Prometheus can authenticate using the service account’s token, ensuring secure and authorized access to the Kubernetes cluster.

Create a Service Account on Kubernetes

create a file prometheus_service_account.yaml

# prometheus_service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
# Secret for service account
apiVersion: v1
kind: Secret
metadata:
  name: prometheus
  namespace: monitoring
  annotations:
    kubernetes.io/service-account.name: prometheus
type: kubernetes.io/service-account-token

---
# Cluster Role to assign to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- nonResourceURLs:
  - /metrics
  - /metrics/cadvisor
  verbs: ["get"]
- apiGroups: ["authentication.k8s.io"]
  resources: ["tokenreviews"]
  verbs: ["create"]

---
# Bind cluster role to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Let’s create the account.

kubectl apply -f prometheus_service_account.yaml

Get the Bearer Token for the Service Account

Prometheus will authenticate using a token associated with the service account. The token is stored in a base64 encrypted format and hence it has to be decoded to be used. To retrieve the token for the prometheus service account, run the following command:

kubectl get secret prometheus -n monitoring -o=jsonpath='{.data.token}' | base64 -d; echo

Save the output in a file /etc/prometheus/token ( referred in the prometheus.yml file for the bearer_token_file attribute ) on the prometheus server.

Ensure the /etc/prometheus/token file is owned by prometheus user.

sudo chown prometheus:promethues /etc/prometheus/prometheus.yml
sudo chmod 644 /etc/prometheus/prometheus.yml

Install Kube-State-metric on Kubernetes to scrape cluster metrics

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics --namespace monitoring

Remember to expose the service IP using NodePort/Ingress.

1.6 Start Prometheus

sudo systemctl daemon-reload  
sudo systemctl start prometheus  
sudo systemctl enable prometheus

Verify it’s running:

sudo systemctl status prometheus

Access Prometheus at http://<VM_IP>:9090.

Verify the Target Health. Click on Status > Target health

Provided there is connectivity between the Prometheus server and Kubernetes cluster, all the targets should be up.

Click on the Endpoints to see the collected metrics. Here I am showing output for kube-state-metrics.

Step 2: Install Grafana on the VM

Wonderful! Our Prometheus server is running and collecting metrics. Next, we need to link up our Grafana server for data visualization.

Grafana is a very powerful and flexible open-source graphics tool that works flawlessly with Prometheus. With Grafana, we will be able to create real-time dashboards that enable us to monitor the health of our Kubernetes clusters along with a variety of performance indicators. Let’s install and configure Grafana, connecting it to our Prometheus instance and configuring informative dashboards to visualize project metrics.

2.1 Install the prerequisite packages:

sudo apt-get install -y apt-transport-https software-properties-common wget

2.2 Import the GPG key:

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

2.3 To add a repository for stable releases, run the following command:

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

[Alternatively] To add a repository for beta releases, run the following command:

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

2.4 Run the following command to update the list of available packages:

# Updates the list of available packages
sudo apt-get update

2.5 To install Grafana OSS, run the following command:

# Installs the latest OSS release:
sudo apt-get install grafana -y

[Alternatively] To install Grafana Enterprise, run the following command:

# Installs the latest Enterprise release:
sudo apt-get install grafana-enterprise

2.6 Start Grafana

sudo systemctl start grafana-server  
sudo systemctl enable grafana-server

Access Grafana at http://<VM_IP>:3000 (default login: admin/admin).

On first login, Grafana will prompt you to update password, set it to a secure password or you can choose to skip it for now.

On successful login, you will be welcomed with the below page.

Step 3: Connect Grafana to Prometheus

In Grafana, go to Connections > Data Sources > Add data source.
Choose Prometheus, set URL to http://<Prometheus Server IP>:9090.
Click Save & Test.

Step 4: Import Kubernetes Dashboards

Lets create some awesome dashboard to visualize the metrics collected by our Promtheus instance.

In Grafana, go to Dashboard > Create Dashbaord.
We will be using pre-built dashboards for this demo. Click on Import Dashboard.
When you click on Import dashboard button, a pop-up will appear prompting you to save dashboard, click on discard.
There are numerous pre-built dashboards available that you can import and use. Visit https://grafana.com/grafana/dashboards/ for a dashboard lookup.
We are using Dashboard id 15661 for this demo. Enter the dashboard ID and click on load.
Select the Prometheus data source and click on Import.

Voila !! Your visually defining dashboard is ready.

Troubleshooting: Common Issues and Solutions

1. Prometheus Fails to Start

Symptoms:

systemctl status prometheus shows failed state.
No web interface at http://<VM_IP>:9090.

Diagnosis:

Check logs:
```
  sudo journalctl -u prometheus --no-pager -f
```
Look for errors like error loading config, permission denied, or port conflicts.

Solutions:

Invalid YAML Syntax:
Validate your prometheus.yml configuration:

  /usr/local/bin/promtool check config /etc/prometheus/prometheus.yml

Permission Issues:
Ensure directories and files are owned by the prometheus user:
```
  sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
```
Port Conflicts:
Check if port 9090 is already in use:
```
  sudo lsof -i :9090
```
Kill conflicting processes or modify Prometheus’s --web.listen-address flag.

2. Prometheus Cannot Scrape Kubernetes Metrics

Symptoms:

No targets visible at http://<VM_IP>:9090/targets.
Targets show connection refused or 403 Forbidden errors.

Diagnosis:

Verify network connectivity to the Kubernetes API server:
```
  curl -k -H "Authorization: Bearer $(cat /etc/prometheus/token.txt)" https://<KUBERNETES_API_SERVER>/api/v1/nodes
```
Replace <KUBERNETES_API_SERVER> with your cluster’s API endpoint.

Solutions:

Invalid API Server Address:
Ensure <KUBERNETES_API_SERVER> in prometheus.yml matches your cluster’s API endpoint (run kubectl cluster-info to confirm).

Expired or Invalid Token:
Regenerate the service account token:

  kubectl get secret $(kubectl get serviceaccount prometheus-external -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode | sudo tee /etc/prometheus/token.txt

Firewall Rules:
Ensure the VM’s firewall allows outbound traffic to the Kubernetes API port (usually 443 or 6443). For UFW:
```
  sudo ufw allow out 6443/tcp
```

3. Grafana Shows "No Data" for Dashboards

Symptoms:

Dashboards load but display "No data" panels.

Diagnosis:

Check if Grafana’s data source is correctly configured:
- In Grafana, go to Configuration > Data Sources > Prometheus.
- Ensure the URL is http://localhost:9090 (if Prometheus runs on the same VM).
Verify Prometheus has scraped metrics:
- Visit http://<VM_IP>:9090/graph and query up to see active targets.

Solutions:

Misconfigured Scrape Jobs:
Confirm your prometheus.yml includes the correct kubernetes_sd_configs for pods/nodes.
Missing Metrics Endpoints:
Ensure Kubernetes components (e.g., kubelet) expose metrics:
```
  curl -k https://<NODE_IP>:10250/metrics
```
Replace <NODE_IP> with a worker node’s IP. If blocked, check kubelet’s --read-only-port flag (should be 10255 for HTTP).

4. High Resource Usage on the VM

Symptoms:

Prometheus/Grafana crashes or becomes unresponsive.
VM CPU/memory usage spikes.

Diagnosis:

Check resource usage:
```
  top
```
Identify if prometheus or grafana-server is consuming excessive resources.

Solutions:

Limit Prometheus Memory:
Edit /etc/systemd/system/prometheus.service and add:

  ExecStart=/usr/local/bin/prometheus \  
    --config.file=/etc/prometheus/prometheus.yml \  
    --storage.tsdb.path=/var/lib/prometheus/ \  
    --web.external-url=http://<VM_IP>:9090 \  
    --storage.tsdb.retention.time=30d \  
    --query.max-concurrency=20 \  
    --query.max-samples=50000000

Restart Prometheus:

  sudo systemctl restart prometheus

Optimize Grafana:
Reduce dashboard refresh intervals or limit concurrent users in /etc/grafana/grafana.ini.

5. Certificate Validation Failures

Symptoms:

Prometheus logs show x509: certificate signed by unknown authority.

Solutions:

Disable TLS verification (for testing only):
Add insecure_skip_verify: true to the tls_config block in prometheus.yml:
```
  tls_config:  
    insecure_skip_verify: true
```
For Production:
Copy the Kubernetes cluster’s CA certificate to the VM and configure Prometheus to trust it:
```
  tls_config:  
    ca_file: /etc/prometheus/cluster-ca.crt
```

Symptoms:

Unable to log in with default credentials (admin/admin).

Solutions:

Reset the admin password:

  sudo grafana-cli admin reset-admin-password newpassword

Check Grafana logs for authentication errors:
```
  sudo journalctl -u grafana-server -f
```

7. Time Synchronization Issues

Symptoms:

Grafana dashboards show metrics with incorrect timestamps.

Solutions:

Ensure the VM’s clock is synchronized with NTP:

  sudo timedatectl set-ntp true  
  sudo systemctl restart systemd-timesyncd

Pro Tips for Maintenance

Backup Grafana Dashboards:

Use the Grafana API to export dashboards:

  curl -s http://admin:admin@localhost:3000/api/dashboards/db/kubernetes-dashboard | jq . > dashboard.json

Rotate Prometheus Logs:
- Configure journald to limit log size in /etc/systemd/journald.conf:
```
  SystemMaxUse=1G
```

Monitor the VM Itself:

Install the Node Exporter to track VM CPU, memory, and disk usage, I am using arm64 CPU architecture, for different architecture visit https://github.com/prometheus/node_exporter/releases/

  wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-arm64.tar.gz 
  tar -xvf node_exporter-*.tar.gz  
  cd node_exporter-*  
  sudo ./node_exporter

Add a scrape job to prometheus.yml:

  - job_name: 'vm'  
    static_configs:  
      - targets: ['localhost:9100']

Conclusion

Implementing Grafana and Prometheus within a VM provides an enterprise-grade monitoring solution that is fail proof, easily scalable, and completely independent from Kubernetes. This guide has walked you through setting up a centralized observability stack capable of monitoring multiple clusters while receiving detailed insights regarding their performance, health, and resource utilization. You are now able to diagnose outages, performance bottlenecks, and optimize the allocation of resources to ensure seamless operation of your infrastructure. With Grafana visualizing the collected metrics from Prometheus in real-time, you are awash with the means to ensure efficiency and reliability during action.

Monitoring Kubernetes with Prometheus and Grafana Outside the Cluster: VM-Based Setup

Introduction

Why Monitor Kubernetes?

Why Run Prometheus and Grafana Outside Kubernetes?

Setup Overview

Prerequisites

Step 1: Install Prometheus on the VM

1.1 Download and Extract Prometheus

1.2 Create System User and Directories

1.3 Configure Prometheus to Discover Kubernetes

1.4 Set Permissions

1.5 Create a Systemd Service

But, how Prometheus gets access to the Kubernetes Kingdom?

Create a Service Account on Kubernetes

Get the Bearer Token for the Service Account

Install Kube-State-metric on Kubernetes to scrape cluster metrics

1.6 Start Prometheus

Step 2: Install Grafana on the VM

2.1 Install the prerequisite packages:

2.2 Import the GPG key:

2.3 To add a repository for stable releases, run the following command:

[Alternatively] To add a repository for beta releases, run the following command:

2.4 Run the following command to update the list of available packages:

2.5 To install Grafana OSS, run the following command:

[Alternatively] To install Grafana Enterprise, run the following command:

2.6 Start Grafana

Step 3: Connect Grafana to Prometheus

Step 4: Import Kubernetes Dashboards

Troubleshooting: Common Issues and Solutions

1. Prometheus Fails to Start

2. Prometheus Cannot Scrape Kubernetes Metrics

3. Grafana Shows "No Data" for Dashboards

4. High Resource Usage on the VM

5. Certificate Validation Failures

6. Grafana Login Issues

7. Time Synchronization Issues

Pro Tips for Maintenance

Conclusion