Monitoring Kubernetes with Prometheus and Grafana Outside the Cluster: VM-Based Setup
Introduction
Monitoring Kubernetes is an essential activity designed to ensure that, as far as possible, your applications, services, and infrastructure run optimally and efficiently. While most of the guides available online focus on deploying monitoring tools inside a Kubernetes cluster, this article will demonstrate how to set up Prometheus (for metric collection) and Grafana (for visualization of data) in a VM, outside the confines of a single node cluster.
Relying on a dedicated VM instead of deploying the whole monitoring stack inside the Kubernetes cluster offers several notable advantages. First, such an architecture isolates the monitoring toolset from the cluster and makes them extremely resilient and operationally highly available in case of cluster failure or malfunction. They also eliminate the risk of exhausting the resources in the cluster since they're running completely separate from the cluster itself.
This tutorial will guide you through the step-by-step process of deploying Prometheus and Grafana on a VM for monitoring Kubernetes through service discovery.
Why Monitor Kubernetes?
Kubernetes manages dynamic, distributed workloads, but its complexity demands visibility. Without monitoring, you risk:
Resource bottlenecks: Unchecked CPU/memory usage can crash nodes.
Silent failures: Pod crashes, network errors, or hung deployments might go unnoticed.
Cost overruns: Overprovisioned resources or idle workloads waste money.
Scaling failures: Autoscalers rely on metrics to add/remove pods or nodes.
Monitoring tools like Prometheus and Grafana act as your "central nervous system," providing real-time insights into cluster health, application performance, and resource efficiency.
Why Run Prometheus and Grafana Outside Kubernetes?
Resilience:
If your Kubernetes cluster crashes (e.g., API server failures), your monitoring tools remain operational to diagnose issues.
Avoid a "circular dependency" where monitoring tools inside the cluster can’t report their own failures.
Resource Isolation:
Prometheus and Grafana won’t compete with Kubernetes workloads for CPU/memory.
Example: A memory-intensive application won’t starve Prometheus, preventing metric blackouts.
Simpler Maintenance:
Upgrade or restart monitoring tools without impacting Kubernetes.
Avoid managing Helm charts, operators, or Custom Resource Definitions (CRDs).
Security:
Limit Kubernetes API access to read-only permissions.
Reduce exposure to cluster-internal threats (e.g., compromised pods).
Multi-Cluster Support:
- A single Prometheus/Grafana instance can monitor multiple clusters.
Setup Overview
Deploy a VM with access to your Kubernetes cluster.
Install Prometheus and Grafana directly on the VM.
Configure Prometheus to discover Kubernetes components.
Visualize metrics in Grafana with prebuilt dashboards.
Prerequisites
A Linux VM (Ubuntu 20.04+ recommended) with:
2+ CPU cores, 4GB+ RAM, 20GB+ disk space.
Network access to your Kubernetes API server.
kubectl
configured on the VM to access your cluster.
Step 1: Install Prometheus on the VM
1.1 Download and Extract Prometheus
I have an arm64 architecture system, for other architecture visit github.com/prometheus/prometheus/releases
wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-arm64.tar.gz
tar -xvf prometheus-3.2.1.linux-arm64.tar.gz
cd prometheus-3.2.1.linux-arm64
1.2 Create System User and Directories
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
1.3 Configure Prometheus to Discover Kubernetes
Create /etc/prometheus/prometheus.yml
:
global:
scrape_interval: 15s
scrape_configs:
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
scheme: https
bearer_token_file: /etc/prometheus/token
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ['192.168.71.143:6443']
# cAdvisor
- job_name: 'cadvisor'. # Capturing container level metrics
scheme: https
bearer_token_file: /etc/prometheus/token
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: node
api_server: 'https://192.168.71.143:6443' #Kubernetes API IP:Port
bearer_token_file: /etc/prometheus/token
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: '192.168.71.143:10250' #NodeIP and Kubelet Port.
source_labels: [__meta_kubernetes_node_name]
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /metrics/cadvisor
# Node Exporter
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: endpoints
api_server: 'https://192.168.71.143:6443'
bearer_token_file: /etc/prometheus/token
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: 'node-exporter'
- source_labels: [__address__]
replacement: '192.168.71.143:31672' #Kubernetes Node IP and NodePort for node-exporter service.
target_label: __address__
# Kube-State-Metrics
- job_name: 'kube-state-metrics'
kubernetes_sd_configs:
- role: service
api_server: 'https://192.168.71.143:6443'
bearer_token_file: /etc/prometheus/token
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
regex: kube-state-metrics # Adjust this to match your service's labels
- source_labels: [__address__]
target_label: __address__
replacement: 192.168.71.143:31673 #Kubernetes Node IP and NodePort for Kube-state-metrics service
1.4 Set Permissions
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
1.5 Create a Systemd Service
Create /etc/systemd/system/prometheus.service
:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/
Restart=always
[Install]
WantedBy=multi-user.target
But, how Prometheus gets access to the Kubernetes Kingdom?
To access Kubernetes and scrape metrics, Prometheus needs proper authorization, often provided via a Service Account. This service account is assigned a token that enables Prometheus to securely communicate with the Kubernetes API server. Additionally, you need to construct a ClusterRole and a ClusterRoleBinding to allow the service account the appropriate rights for scraping metrics from the Kubernetes nodes and services. While the ClusterRoleBinding ties the service account to this role, granting it cluster-wide access, the ClusterRole specifies the operations Prometheus can perform, such as reading pod and node metrics. Once set up, Prometheus can authenticate using the service account’s token, ensuring secure and authorized access to the Kubernetes cluster.
Create a Service Account on Kubernetes
create a file prometheus_service_account.yaml
# prometheus_service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
# Secret for service account
apiVersion: v1
kind: Secret
metadata:
name: prometheus
namespace: monitoring
annotations:
kubernetes.io/service-account.name: prometheus
type: kubernetes.io/service-account-token
---
# Cluster Role to assign to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- nonResourceURLs:
- /metrics
- /metrics/cadvisor
verbs: ["get"]
- apiGroups: ["authentication.k8s.io"]
resources: ["tokenreviews"]
verbs: ["create"]
---
# Bind cluster role to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Let’s create the account.
kubectl apply -f prometheus_service_account.yaml
Get the Bearer Token for the Service Account
Prometheus will authenticate using a token associated with the service account. The token is stored in a base64 encrypted format and hence it has to be decoded to be used. To retrieve the token for the prometheus
service account, run the following command:
kubectl get secret prometheus -n monitoring -o=jsonpath='{.data.token}' | base64 -d; echo
Save the output in a file /etc/prometheus/token ( referred in the prometheus.yml file for the bearer_token_file attribute ) on the prometheus server.
Ensure the /etc/prometheus/token file is owned by prometheus user.
sudo chown prometheus:promethues /etc/prometheus/prometheus.yml
sudo chmod 644 /etc/prometheus/prometheus.yml
Install Kube-State-metric on Kubernetes to scrape cluster metrics
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics --namespace monitoring
Remember to expose the service IP using NodePort/Ingress.
1.6 Start Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
Verify it’s running:
sudo systemctl status prometheus
Access Prometheus at http://<VM_IP>:9090
.
Verify the Target Health. Click on Status > Target health
Provided there is connectivity between the Prometheus server and Kubernetes cluster, all the targets should be up.
Click on the Endpoints to see the collected metrics. Here I am showing output for kube-state-metrics.
Step 2: Install Grafana on the VM
Wonderful! Our Prometheus server is running and collecting metrics. Next, we need to link up our Grafana server for data visualization.
Grafana is a very powerful and flexible open-source graphics tool that works flawlessly with Prometheus. With Grafana, we will be able to create real-time dashboards that enable us to monitor the health of our Kubernetes clusters along with a variety of performance indicators. Let’s install and configure Grafana, connecting it to our Prometheus instance and configuring informative dashboards to visualize project metrics.
2.1 Install the prerequisite packages:
sudo apt-get install -y apt-transport-https software-properties-common wget
2.2 Import the GPG key:
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
2.3 To add a repository for stable releases, run the following command:
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
[Alternatively] To add a repository for beta releases, run the following command:
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
2.4 Run the following command to update the list of available packages:
# Updates the list of available packages
sudo apt-get update
2.5 To install Grafana OSS, run the following command:
# Installs the latest OSS release:
sudo apt-get install grafana -y
[Alternatively] To install Grafana Enterprise, run the following command:
# Installs the latest Enterprise release:
sudo apt-get install grafana-enterprise
2.6 Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
Access Grafana at http://<VM_IP>:3000
(default login: admin/admin
).
On first login, Grafana will prompt you to update password, set it to a secure password or you can choose to skip it for now.
On successful login, you will be welcomed with the below page.
Step 3: Connect Grafana to Prometheus
In Grafana, go to Connections > Data Sources > Add data source.
Choose Prometheus, set URL to
http://<Prometheus Server IP>:9090
.Click Save & Test.
Step 4: Import Kubernetes Dashboards
Lets create some awesome dashboard to visualize the metrics collected by our Promtheus instance.
In Grafana, go to Dashboard > Create Dashbaord.
We will be using pre-built dashboards for this demo. Click on Import Dashboard.
When you click on Import dashboard button, a pop-up will appear prompting you to save dashboard, click on discard.
There are numerous pre-built dashboards available that you can import and use. Visit https://grafana.com/grafana/dashboards/ for a dashboard lookup.
We are using Dashboard id 15661 for this demo. Enter the dashboard ID and click on load.
Select the Prometheus data source and click on Import.
Voila !! Your visually defining dashboard is ready.
Troubleshooting: Common Issues and Solutions
1. Prometheus Fails to Start
Symptoms:
systemctl status prometheus
showsfailed
state.No web interface at
http://<VM_IP>:9090
.
Diagnosis:
Check logs:
sudo journalctl -u prometheus --no-pager -f
Look for errors like
error loading config
,permission denied
, or port conflicts.
Solutions:
Invalid YAML Syntax:
Validate yourprometheus.yml
configuration:/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml
Permission Issues:
Ensure directories and files are owned by theprometheus
user:sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
Port Conflicts:
Check if port9090
is already in use:sudo lsof -i :9090
Kill conflicting processes or modify Prometheus’s
--web.listen-address
flag.
2. Prometheus Cannot Scrape Kubernetes Metrics
Symptoms:
No targets visible at
http://<VM_IP>:9090/targets
.Targets show
connection refused
or403 Forbidden
errors.
Diagnosis:
Verify network connectivity to the Kubernetes API server:
curl -k -H "Authorization: Bearer $(cat /etc/prometheus/token.txt)" https://<KUBERNETES_API_SERVER>/api/v1/nodes
Replace
<KUBERNETES_API_SERVER>
with your cluster’s API endpoint.
Solutions:
Invalid API Server Address:
Ensure<KUBERNETES_API_SERVER>
inprometheus.yml
matches your cluster’s API endpoint (runkubectl cluster-info
to confirm).Expired or Invalid Token:
Regenerate the service account token:kubectl get secret $(kubectl get serviceaccount prometheus-external -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode | sudo tee /etc/prometheus/token.txt
Firewall Rules:
Ensure the VM’s firewall allows outbound traffic to the Kubernetes API port (usually 443 or 6443). For UFW:sudo ufw allow out 6443/tcp
3. Grafana Shows "No Data" for Dashboards
Symptoms:
- Dashboards load but display "No data" panels.
Diagnosis:
Check if Grafana’s data source is correctly configured:
In Grafana, go to Configuration > Data Sources > Prometheus.
Ensure the URL is
http://localhost:9090
(if Prometheus runs on the same VM).
Verify Prometheus has scraped metrics:
- Visit
http://<VM_IP>:9090/graph
and queryup
to see active targets.
- Visit
Solutions:
Misconfigured Scrape Jobs:
Confirm yourprometheus.yml
includes the correctkubernetes_sd_configs
for pods/nodes.Missing Metrics Endpoints:
Ensure Kubernetes components (e.g., kubelet) expose metrics:curl -k https://<NODE_IP>:10250/metrics
Replace
<NODE_IP>
with a worker node’s IP. If blocked, check kubelet’s--read-only-port
flag (should be10255
for HTTP).
4. High Resource Usage on the VM
Symptoms:
Prometheus/Grafana crashes or becomes unresponsive.
VM CPU/memory usage spikes.
Diagnosis:
Check resource usage:
top
Identify if
prometheus
orgrafana-server
is consuming excessive resources.
Solutions:
Limit Prometheus Memory:
Edit/etc/systemd/system/prometheus.service
and add:ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus/ \ --web.external-url=http://<VM_IP>:9090 \ --storage.tsdb.retention.time=30d \ --query.max-concurrency=20 \ --query.max-samples=50000000
Restart Prometheus:
sudo systemctl restart prometheus
Optimize Grafana:
Reduce dashboard refresh intervals or limit concurrent users in/etc/grafana/grafana.ini
.
5. Certificate Validation Failures
Symptoms:
- Prometheus logs show
x509: certificate signed by unknown authority
.
Solutions:
Disable TLS verification (for testing only):
Addinsecure_skip_verify: true
to thetls_config
block inprometheus.yml
:tls_config: insecure_skip_verify: true
For Production:
Copy the Kubernetes cluster’s CA certificate to the VM and configure Prometheus to trust it:tls_config: ca_file: /etc/prometheus/cluster-ca.crt
6. Grafana Login Issues
Symptoms:
- Unable to log in with default credentials (
admin/admin
).
Solutions:
Reset the admin password:
sudo grafana-cli admin reset-admin-password newpassword
Check Grafana logs for authentication errors:
sudo journalctl -u grafana-server -f
7. Time Synchronization Issues
Symptoms:
- Grafana dashboards show metrics with incorrect timestamps.
Solutions:
Ensure the VM’s clock is synchronized with NTP:
sudo timedatectl set-ntp true sudo systemctl restart systemd-timesyncd
Pro Tips for Maintenance
Backup Grafana Dashboards:
Use the Grafana API to export dashboards:
curl -s http://admin:admin@localhost:3000/api/dashboards/db/kubernetes-dashboard | jq . > dashboard.json
Rotate Prometheus Logs:
Configure
journald
to limit log size in/etc/systemd/journald.conf
:SystemMaxUse=1G
Monitor the VM Itself:
Install the Node Exporter to track VM CPU, memory, and disk usage, I am using arm64 CPU architecture, for different architecture visit https://github.com/prometheus/node_exporter/releases/
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-arm64.tar.gz tar -xvf node_exporter-*.tar.gz cd node_exporter-* sudo ./node_exporter
Add a scrape job to
prometheus.yml
:- job_name: 'vm' static_configs: - targets: ['localhost:9100']
Conclusion
Implementing Grafana and Prometheus within a VM provides an enterprise-grade monitoring solution that is fail proof, easily scalable, and completely independent from Kubernetes. This guide has walked you through setting up a centralized observability stack capable of monitoring multiple clusters while receiving detailed insights regarding their performance, health, and resource utilization. You are now able to diagnose outages, performance bottlenecks, and optimize the allocation of resources to ensure seamless operation of your infrastructure. With Grafana visualizing the collected metrics from Prometheus in real-time, you are awash with the means to ensure efficiency and reliability during action.