Observability¶
kube-prometheus-stack¶
What it is¶
A single Helm chart that installs the complete monitoring stack:
| Component | Role |
|---|---|
| Prometheus | Time-series metrics collection and storage |
| Grafana | Visualization dashboards |
| AlertManager | Alert routing and notifications |
| node-exporter | Per-node system metrics (DaemonSet) |
| kube-state-metrics | Kubernetes object state metrics |
Installation via Flux¶
# HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: prometheus-community
namespace: flux-system
spec:
interval: 1h
url: https://prometheus-community.github.io/helm-charts
# HelmRelease (key values)
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kube-prometheus-stack
namespace: monitoring
spec:
interval: 30m
chart:
spec:
chart: kube-prometheus-stack
version: "82.10.1"
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
install:
createNamespace: true
values:
# Disable components k3s doesn't expose in standard form
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false # Cilium replaced kube-proxy
kubelet:
enabled: true
serviceMonitor:
https: true
insecureSkipVerify: true
cAdvisor: true
grafana:
enabled: true
admin:
existingSecret: grafana-admin-secret
userKey: admin-user
passwordKey: admin-password
ingress:
enabled: true
ingressClassName: traefik
annotations:
traefik.ingress.kubernetes.io/ssl-redirect: "true"
hosts:
- grafana.cluster.kcn333.com
tls:
- secretName: grafana-tls
hosts:
- grafana.cluster.kcn333.com
prometheus:
prometheusSpec:
hostNetwork: true # critical for k3s — see below
hostPID: true
retention: 7d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
Why disabled components?¶
k3s embeds some control-plane components and doesn't expose them on standard ports. Trying to scrape them causes persistent errors in Prometheus:
| Component | Why disabled |
|---|---|
kubeControllerManager |
k3s embeds it, non-standard endpoint |
kubeScheduler |
Same |
kubeEtcd |
Embedded etcd, different port |
kubeProxy |
Replaced by Cilium eBPF |
Problem: no data in Grafana — Cilium blocking pod→nodeIP traffic¶
Symptom: "No data" on Kubernetes Compute Resources dashboards for CPU/Memory.
Root cause: Cilium in VXLAN mode doesn't route traffic from pods to node IPs (192.168.55.x). Prometheus running as a regular pod couldn't reach kubelet on port 10250 or node-exporter on port 9100 on other nodes.
Diagnosis:
# From Prometheus pod — this failed
kubectl -n monitoring exec -it prometheus-kube-prometheus-stack-prometheus-0 -- \
wget -qO- --no-check-certificate https://192.168.55.10:10250/healthz
# Internet worked fine — confirming the issue is node IP routing, not network in general
kubectl -n monitoring exec -it prometheus-kube-prometheus-stack-prometheus-0 -- \
wget -qO- https://8.8.8.8 | head -3
The wrong approach (tried, don't do this):
# Don't — this opens too much
hostServices:
enabled: true
autoDirectNodeRoutes: true
bpf:
hostLegacyRouting: true
The right approach — hostNetwork: true for Prometheus:
Prometheus runs on the host network, has direct access to kubelet (:10250) and node-exporter (:9100). No Cilium config changes needed. This is the documented approach for k3s + kube-prometheus-stack.
Problem: UFW blocking port 9100 between nodes¶
Symptom: Only 1 node visible in the "Nodes" Grafana dashboard (the one where Prometheus was scheduled).
Prometheus with hostNetwork: true exits with the node's IP (192.168.55.x) and tries to reach other nodes. UFW on those nodes didn't have a rule for port 9100.
Fix:
Add permanently to Ansible playbook:
Helm subchart keys — gotcha¶
kube-prometheus-stack is built from subcharts. Each subchart has two key paths — one for the wrapper and one for the actual subchart. Getting these mixed up is a common source of confusion:
# WRAPPER — enable/disable and dashboard config
nodeExporter:
enabled: true
# SUBCHART — resources, affinity, tolerations
prometheus-node-exporter: # note the hyphen in the key name
resources:
requests:
cpu: 10m
memory: 32Mi
Always verify key names with helm show values:
Resource requests/limits¶
After a few days of observing actual usage in Grafana, these were set:
| Component | CPU req | CPU lim | Mem req | Mem lim |
|---|---|---|---|---|
| Prometheus | 200m | 1000m | 1024Mi | 2048Mi |
| Grafana | 50m | 200m | 256Mi | 512Mi |
| AlertManager | 10m | 100m | 64Mi | 128Mi |
| node-exporter | 10m | 100m | 32Mi | 64Mi |
| kube-state-metrics | 10m | 100m | 64Mi | 128Mi |
QoS class: Burstable for all — requests < limits, which is appropriate for most workloads.
Grafana admin password via SealedSecret¶
kubectl create secret generic grafana-admin-secret \
--namespace monitoring \
--from-literal=admin-user=admin \
--from-literal=admin-password=YOUR-PASSWORD \
--dry-run=client -o yaml | \
kubeseal --format yaml \
--cert ~/.config/kubeseal/pub-sealed-secrets.pem \
> apps/base/monitoring/grafana-secret-sealed.yaml
Reference in HelmRelease:
grafana:
admin:
existingSecret: grafana-admin-secret
userKey: admin-user
passwordKey: admin-password
TLS certificate for Grafana¶
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: grafana-tls
namespace: monitoring # must be in same namespace as Ingress
spec:
secretName: grafana-tls # name used in Ingress tls.secretName
dnsNames:
- grafana.cluster.kcn333.com
issuerRef:
name: letsencrypt-prod-cluster-issuer
kind: ClusterIssuer
Useful Grafana dashboards¶
| Dashboard | What it shows |
|---|---|
| Kubernetes / Compute Resources / Cluster | CPU/RAM across the whole cluster |
| Node Exporter / Nodes | Per-node: CPU, RAM, disk, network |
| Kubernetes / Compute Resources / Namespace | Usage per namespace |
| Kubernetes / Compute Resources / Pod | Usage per pod |
AlertManager¶
Configuration as SealedSecret¶
The AlertManager config Secret must be named exactly alertmanager-kube-prometheus-stack-alertmanager — kube-prometheus-stack auto-detects it by name.
cat <<EOF > /tmp/alertmanager.yaml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: ntfy
routes:
- receiver: "null" # silence noisy startup alerts
matchers:
- alertname = "InfoInhibitor"
receivers:
- name: ntfy
webhook_configs:
- url: 'http://ntfy-webhook:8080'
send_resolved: true
- name: "null"
EOF
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
--namespace monitoring \
--from-file=alertmanager.yaml=/tmp/alertmanager.yaml \
--dry-run=client -o yaml | \
kubeseal --format yaml \
--cert ~/.config/kubeseal/pub-sealed-secrets.pem \
> apps/base/monitoring/alertmanager-config-sealed.yaml
rm /tmp/alertmanager.yaml
Route ordering matters! AlertManager uses the first matching route. The null receiver must come before the default ntfy route.
ntfy webhook adapter¶
AlertManager sends JSON payloads, but ntfy expects plain text with HTTP headers. There's no native integration — a small Python adapter runs as a Deployment and translates between the two.
Key implementation details:
- User-Agent: Mozilla/5.0 header is required when ntfy is behind Cloudflare Tunnel (returns HTTP 1010 without it)
- python -u in the container command disables stdout buffering (required for k8s logs)
- HTTPServer.allow_reuse_address = True prevents Address already in use on pod restart
- The URL (NTFY_URL) can't use _file in AlertManager's webhook_configs — the whole config must be a SealedSecret
# Deployment snippet
command: ["python", "-u", "/app/app.py"] # -u = unbuffered stdout
env:
- name: NTFY_URL
valueFrom:
secretKeyRef:
name: alertmanager-ntfy-secret
key: url
- name: NTFY_USER
valueFrom:
secretKeyRef:
name: alertmanager-ntfy-secret
key: username
- name: NTFY_PASS
valueFrom:
secretKeyRef:
name: alertmanager-ntfy-secret
key: password
Custom PrometheusRule¶
Custom alerts as a CRD, managed by Flux. The release: kube-prometheus-stack label is mandatory — Prometheus ignores PrometheusRules without it.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: homelab-custom-rules
namespace: monitoring
labels:
release: kube-prometheus-stack # required!
spec:
groups:
- name: homelab.nodes
interval: 1m
rules:
- alert: NodeHighCPU
expr: |
(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% (threshold 90%)"
- alert: NodeHighMemory
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory on {{ $labels.instance }}"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[10m]) >= 5
for: 0m
labels:
severity: critical
annotations:
summary: "Pod crash-looping: {{ $labels.pod }}"
Alert states¶
for: 0m means fire immediately when the expression is true — useful for critical alerts where every second matters.
Testing alerts end-to-end¶
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
sleep 2
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "Test", "severity": "critical", "namespace": "monitoring"},
"annotations": {"summary": "Test", "description": "Does it work?"},
"generatorURL": "http://localhost"
}]'
Loki + Promtail¶
What it does¶
Loki is a log aggregation system. Promtail is a log collector that runs as a DaemonSet on each node, scraping pod logs and shipping them to Loki. All logs are then queryable in Grafana alongside metrics.
Architecture¶
All pods → stdout/stderr
│
▼ (Promtail DaemonSet reads /var/log/pods/)
Promtail
│
▼
Loki (SingleBinary mode)
│
▼ S3 (Garage)
│
Grafana Explore
Loki installation (SingleBinary mode)¶
SingleBinary = all Loki components in one pod. Perfect for homelab — low resource overhead.
Key values in HelmRelease:
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: s3
s3:
endpoint: http://192.168.0.46:3900 # Garage S3
region: garage
s3ForcePathStyle: true
insecure: true
bucketNames:
chunks: loki-logs
ruler: loki-logs # required! causes error if missing
admin: loki-logs
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 7d
compactor:
retention_enabled: true
delete_request_store: s3 # required when retention_enabled: true!
singleBinary:
replicas: 1
# Disable for homelab — not enough RAM
chunksCache:
enabled: false
resultsCache:
enabled: false
# Disable unused components in SingleBinary mode
read:
replicas: 0
write:
replicas: 0
backend:
replicas: 0
gateway:
enabled: false
test:
enabled: false
lokiCanary:
enabled: false
Common Loki errors¶
| Error | Fix |
|---|---|
Please define loki.storage.bucketNames.ruler |
Add bucketNames.ruler: loki-logs |
compactor.delete-request-store should be configured |
Add delete_request_store: s3 |
| OOM / pod restarts | Disable chunksCache and resultsCache |
Promtail config¶
# HelmRelease values
config:
clients:
- url: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push
Loki as Grafana datasource¶
URL: http://loki.loki.svc.cluster.local:3100
LogQL basics¶
# Filter by namespace
{namespace="kube-system"}
# Filter by pod name (note: .* not just *)
{namespace="clients", pod=~"clients-api.*"}
# Filter by content
{namespace="clients"} |= "request"
# Exclude specific path
{namespace="clients"} |= "request" != "/actuator"
# Case-insensitive match
{namespace="clients"} |~ "(?i)error"
Common LogQL mistake: {pod=~"clients-api*"} matches clients-ap, clients-api, clients-apii, etc. The * in regex means "zero or more of the preceding character". Use clients-api.* for "clients-api followed by anything".
Hubble UI (WIP)¶
What it is¶
Hubble is Cilium's built-in network observability layer. It provides a real-time view of all network flows in the cluster — which pods are talking to which, what's being allowed/dropped, latency, etc.
Architecture¶
Cilium Agent (per node) → port 4244 (gRPC)
↑
Hubble Relay (collects flows from all agents)
↑
Hubble UI → https://hubble.cluster.kcn333.com
Why it doesn't work in VXLAN mode¶
The relay pod has an IP from the pod pool (10.0.x.x). It tries to connect to Cilium agents on node IPs (192.168.55.x:4244). In VXLAN mode, Cilium doesn't route pod→nodeIP traffic — packets disappear in the BPF datapath before reaching the physical interface.
Verified with tcpdump: SYN packets visible on the veth interface but never reaching enp1s0.
Fix requires native routing:
kubeProxyReplacement: true
routingMode: native
autoDirectNodeRoutes: true
ipv4NativeRoutingCIDR: "10.0.0.0/8"
Combined with --disable-kube-proxy in k3s.service.
⚠️ This migration must be done during a full cluster restart — not as a rolling DaemonSet update. Mixed VXLAN/native routing breaks network connectivity entirely.
Current Cilium config (VXLAN, Hubble enabled but relay broken)¶
values:
k8sServiceHost: 192.168.55.10
k8sServicePort: 6443
operator:
replicas: 1
hubble:
enabled: true
tls:
auto:
enabled: true
method: helm
relay:
enabled: true
ui:
enabled: true
ingress:
enabled: true
ingressClassName: traefik
hosts:
- hubble.cluster.kcn333.com
tls:
- secretName: hubble-tls
hosts:
- hubble.cluster.kcn333.com
The UI and cert are deployed; relay is the blocker.
Lessons from the debugging marathon¶
- Check network connectivity before assuming TLS issues —
openssl s_clientfrom node to node showed TLS was fine; the packets just never arrived - Switching
kubeProxyReplacement: truewithout also settingroutingMode: nativecauses instability — Cilium tries to take over services but doesn't have full native routing --disable-kube-proxyin k3s only makes sense when Cilium is in fullkubeProxyReplacement: truemode- Rolling update of the Cilium DaemonSet is not safe for routing mode changes — one node ends up with a different mode, traffic breaks
Useful Commands¶
# Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
# → http://localhost:9090/targets
# AlertManager status
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/status | python3 -m json.tool | grep -A5 "routes"
# Check active alerts
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool | grep -E "alertname|severity|state"
# PrometheusRules
kubectl get prometheusrule -A
kubectl describe prometheusrule homelab-custom-rules -n monitoring
# Loki query via CLI
kubectl port-forward -n loki svc/loki 3100:3100 &
curl -s "http://localhost:3100/loki/api/v1/query_range" \
--data-urlencode 'query={namespace="kube-system"}' | python3 -m json.tool | head -50
# Hubble / Cilium
kubectl -n kube-system exec ds/cilium -- cilium status | grep -E "Hubble|Cluster health"
kubectl logs -n kube-system deployment/hubble-relay --tail=20
# Flux
flux get helmrelease kube-prometheus-stack -n monitoring
flux reconcile helmrelease kube-prometheus-stack -n monitoring