Observability¶

kube-prometheus-stack¶

What it is¶

A single Helm chart that installs the complete monitoring stack:

Component	Role
Prometheus	Time-series metrics collection and storage
Grafana	Visualization dashboards
AlertManager	Alert routing and notifications
node-exporter	Per-node system metrics (DaemonSet)
kube-state-metrics	Kubernetes object state metrics

Installation via Flux¶

# HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: prometheus-community
  namespace: flux-system
spec:
  interval: 1h
  url: https://prometheus-community.github.io/helm-charts

# HelmRelease (key values)
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: kube-prometheus-stack
  namespace: monitoring
spec:
  interval: 30m
  chart:
    spec:
      chart: kube-prometheus-stack
      version: "82.10.1"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  install:
    createNamespace: true
  values:
    # Disable components k3s doesn't expose in standard form
    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false
    kubeEtcd:
      enabled: false
    kubeProxy:
      enabled: false   # Cilium replaced kube-proxy

    kubelet:
      enabled: true
      serviceMonitor:
        https: true
        insecureSkipVerify: true
        cAdvisor: true

    grafana:
      enabled: true
      admin:
        existingSecret: grafana-admin-secret
        userKey: admin-user
        passwordKey: admin-password
      ingress:
        enabled: true
        ingressClassName: traefik
        annotations:
          traefik.ingress.kubernetes.io/ssl-redirect: "true"
        hosts:
          - grafana.cluster.kcn333.com
        tls:
          - secretName: grafana-tls
            hosts:
              - grafana.cluster.kcn333.com

    prometheus:
      prometheusSpec:
        hostNetwork: true    # critical for k3s — see below
        hostPID: true
        retention: 7d
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: longhorn
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi

Why disabled components?¶

k3s embeds some control-plane components and doesn't expose them on standard ports. Trying to scrape them causes persistent errors in Prometheus:

Component	Why disabled
`kubeControllerManager`	k3s embeds it, non-standard endpoint
`kubeScheduler`	Same
`kubeEtcd`	Embedded etcd, different port
`kubeProxy`	Replaced by Cilium eBPF

Problem: no data in Grafana — Cilium blocking pod→nodeIP traffic¶

Symptom: "No data" on Kubernetes Compute Resources dashboards for CPU/Memory.

Root cause: Cilium in VXLAN mode doesn't route traffic from pods to node IPs (192.168.55.x). Prometheus running as a regular pod couldn't reach kubelet on port 10250 or node-exporter on port 9100 on other nodes.

Diagnosis:

# From Prometheus pod — this failed
kubectl -n monitoring exec -it prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -qO- --no-check-certificate https://192.168.55.10:10250/healthz

# Internet worked fine — confirming the issue is node IP routing, not network in general
kubectl -n monitoring exec -it prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -qO- https://8.8.8.8 | head -3

The wrong approach (tried, don't do this):

# Don't — this opens too much
hostServices:
  enabled: true
autoDirectNodeRoutes: true
bpf:
  hostLegacyRouting: true

The right approach — hostNetwork: true for Prometheus:

prometheus:
  prometheusSpec:
    hostNetwork: true
    hostPID: true

Prometheus runs on the host network, has direct access to kubelet (:10250) and node-exporter (:9100). No Cilium config changes needed. This is the documented approach for k3s + kube-prometheus-stack.

Problem: UFW blocking port 9100 between nodes¶

Symptom: Only 1 node visible in the "Nodes" Grafana dashboard (the one where Prometheus was scheduled).

Prometheus with hostNetwork: true exits with the node's IP (192.168.55.x) and tries to reach other nodes. UFW on those nodes didn't have a rule for port 9100.

Fix:

ansible all -m shell -a "ufw allow from 192.168.55.0/24 to any port 9100" -b

Add permanently to Ansible playbook:

- { port: '9100', proto: 'tcp' }   # node-exporter

Helm subchart keys — gotcha¶

kube-prometheus-stack is built from subcharts. Each subchart has two key paths — one for the wrapper and one for the actual subchart. Getting these mixed up is a common source of confusion:

# WRAPPER — enable/disable and dashboard config
nodeExporter:
  enabled: true

# SUBCHART — resources, affinity, tolerations
prometheus-node-exporter:    # note the hyphen in the key name
  resources:
    requests:
      cpu: 10m
      memory: 32Mi

Always verify key names with helm show values:

helm show values prometheus-community/kube-prometheus-stack | grep "prometheus-node-exporter"

Resource requests/limits¶

After a few days of observing actual usage in Grafana, these were set:

Component	CPU req	CPU lim	Mem req	Mem lim
Prometheus	200m	1000m	1024Mi	2048Mi
Grafana	50m	200m	256Mi	512Mi
AlertManager	10m	100m	64Mi	128Mi
node-exporter	10m	100m	32Mi	64Mi
kube-state-metrics	10m	100m	64Mi	128Mi

QoS class: Burstable for all — requests < limits, which is appropriate for most workloads.

Grafana admin password via SealedSecret¶

kubectl create secret generic grafana-admin-secret \
  --namespace monitoring \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=YOUR-PASSWORD \
  --dry-run=client -o yaml | \
kubeseal --format yaml \
  --cert ~/.config/kubeseal/pub-sealed-secrets.pem \
  > apps/base/monitoring/grafana-secret-sealed.yaml

Reference in HelmRelease:

grafana:
  admin:
    existingSecret: grafana-admin-secret
    userKey: admin-user
    passwordKey: admin-password

TLS certificate for Grafana¶

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: grafana-tls
  namespace: monitoring      # must be in same namespace as Ingress
spec:
  secretName: grafana-tls   # name used in Ingress tls.secretName
  dnsNames:
    - grafana.cluster.kcn333.com
  issuerRef:
    name: letsencrypt-prod-cluster-issuer
    kind: ClusterIssuer

Useful Grafana dashboards¶

Dashboard	What it shows
Kubernetes / Compute Resources / Cluster	CPU/RAM across the whole cluster
Node Exporter / Nodes	Per-node: CPU, RAM, disk, network
Kubernetes / Compute Resources / Namespace	Usage per namespace
Kubernetes / Compute Resources / Pod	Usage per pod

AlertManager¶

Configuration as SealedSecret¶

The AlertManager config Secret must be named exactly alertmanager-kube-prometheus-stack-alertmanager — kube-prometheus-stack auto-detects it by name.

cat <<EOF > /tmp/alertmanager.yaml
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: ntfy
  routes:
    - receiver: "null"       # silence noisy startup alerts
      matchers:
        - alertname = "InfoInhibitor"
receivers:
  - name: ntfy
    webhook_configs:
      - url: 'http://ntfy-webhook:8080'
        send_resolved: true
  - name: "null"
EOF

kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
  --namespace monitoring \
  --from-file=alertmanager.yaml=/tmp/alertmanager.yaml \
  --dry-run=client -o yaml | \
kubeseal --format yaml \
  --cert ~/.config/kubeseal/pub-sealed-secrets.pem \
  > apps/base/monitoring/alertmanager-config-sealed.yaml

rm /tmp/alertmanager.yaml

Route ordering matters! AlertManager uses the first matching route. The null receiver must come before the default ntfy route.

ntfy webhook adapter¶

AlertManager sends JSON payloads, but ntfy expects plain text with HTTP headers. There's no native integration — a small Python adapter runs as a Deployment and translates between the two.

Key implementation details: - User-Agent: Mozilla/5.0 header is required when ntfy is behind Cloudflare Tunnel (returns HTTP 1010 without it) - python -u in the container command disables stdout buffering (required for k8s logs) - HTTPServer.allow_reuse_address = True prevents Address already in use on pod restart - The URL (NTFY_URL) can't use _file in AlertManager's webhook_configs — the whole config must be a SealedSecret

# Deployment snippet
command: ["python", "-u", "/app/app.py"]  # -u = unbuffered stdout
env:
  - name: NTFY_URL
    valueFrom:
      secretKeyRef:
        name: alertmanager-ntfy-secret
        key: url
  - name: NTFY_USER
    valueFrom:
      secretKeyRef:
        name: alertmanager-ntfy-secret
        key: username
  - name: NTFY_PASS
    valueFrom:
      secretKeyRef:
        name: alertmanager-ntfy-secret
        key: password

Custom PrometheusRule¶

Custom alerts as a CRD, managed by Flux. The release: kube-prometheus-stack label is mandatory — Prometheus ignores PrometheusRules without it.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: homelab-custom-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack    # required!
spec:
  groups:
    - name: homelab.nodes
      interval: 1m
      rules:
        - alert: NodeHighCPU
          expr: |
            (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High CPU on {{ $labels.instance }}"
            description: "CPU usage is {{ $value | humanize }}% (threshold 90%)"

        - alert: NodeHighMemory
          expr: |
            (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High memory on {{ $labels.instance }}"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[10m]) >= 5
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Pod crash-looping: {{ $labels.pod }}"

Alert states¶

inactive → pending → firing
           (expr true)  (for: elapsed)

for: 0m means fire immediately when the expression is true — useful for critical alerts where every second matters.

Testing alerts end-to-end¶

kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
sleep 2
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {"alertname": "Test", "severity": "critical", "namespace": "monitoring"},
    "annotations": {"summary": "Test", "description": "Does it work?"},
    "generatorURL": "http://localhost"
  }]'

Loki + Promtail¶

What it does¶

Loki is a log aggregation system. Promtail is a log collector that runs as a DaemonSet on each node, scraping pod logs and shipping them to Loki. All logs are then queryable in Grafana alongside metrics.

Architecture¶

All pods → stdout/stderr
               │
               ▼ (Promtail DaemonSet reads /var/log/pods/)
          Promtail
               │
               ▼
           Loki (SingleBinary mode)
               │
               ▼ S3 (Garage)
               │
           Grafana Explore

Loki installation (SingleBinary mode)¶

SingleBinary = all Loki components in one pod. Perfect for homelab — low resource overhead.

Key values in HelmRelease:

deploymentMode: SingleBinary

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: s3
    s3:
      endpoint: http://192.168.0.46:3900    # Garage S3
      region: garage
      s3ForcePathStyle: true
      insecure: true
    bucketNames:
      chunks: loki-logs
      ruler: loki-logs      # required! causes error if missing
      admin: loki-logs
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    retention_period: 7d
  compactor:
    retention_enabled: true
    delete_request_store: s3    # required when retention_enabled: true!

singleBinary:
  replicas: 1

# Disable for homelab — not enough RAM
chunksCache:
  enabled: false
resultsCache:
  enabled: false

# Disable unused components in SingleBinary mode
read:
  replicas: 0
write:
  replicas: 0
backend:
  replicas: 0
gateway:
  enabled: false
test:
  enabled: false
lokiCanary:
  enabled: false

Common Loki errors¶

Error	Fix
`Please define loki.storage.bucketNames.ruler`	Add `bucketNames.ruler: loki-logs`
`compactor.delete-request-store should be configured`	Add `delete_request_store: s3`
OOM / pod restarts	Disable `chunksCache` and `resultsCache`

Promtail config¶

# HelmRelease values
config:
  clients:
    - url: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push

Loki as Grafana datasource¶

URL: http://loki.loki.svc.cluster.local:3100

LogQL basics¶

# Filter by namespace
{namespace="kube-system"}

# Filter by pod name (note: .* not just *)
{namespace="clients", pod=~"clients-api.*"}

# Filter by content
{namespace="clients"} |= "request"

# Exclude specific path
{namespace="clients"} |= "request" != "/actuator"

# Case-insensitive match
{namespace="clients"} |~ "(?i)error"

Common LogQL mistake: {pod=~"clients-api*"} matches clients-ap, clients-api, clients-apii, etc. The * in regex means "zero or more of the preceding character". Use clients-api.* for "clients-api followed by anything".

Hubble UI (WIP)¶

What it is¶

Hubble is Cilium's built-in network observability layer. It provides a real-time view of all network flows in the cluster — which pods are talking to which, what's being allowed/dropped, latency, etc.

Architecture¶

Cilium Agent (per node) → port 4244 (gRPC)
         ↑
Hubble Relay (collects flows from all agents)
         ↑
Hubble UI → https://hubble.cluster.kcn333.com

Why it doesn't work in VXLAN mode¶

The relay pod has an IP from the pod pool (10.0.x.x). It tries to connect to Cilium agents on node IPs (192.168.55.x:4244). In VXLAN mode, Cilium doesn't route pod→nodeIP traffic — packets disappear in the BPF datapath before reaching the physical interface.

Verified with tcpdump: SYN packets visible on the veth interface but never reaching enp1s0.

Fix requires native routing:

kubeProxyReplacement: true
routingMode: native
autoDirectNodeRoutes: true
ipv4NativeRoutingCIDR: "10.0.0.0/8"

Combined with --disable-kube-proxy in k3s.service.

⚠️ This migration must be done during a full cluster restart — not as a rolling DaemonSet update. Mixed VXLAN/native routing breaks network connectivity entirely.

Current Cilium config (VXLAN, Hubble enabled but relay broken)¶

values:
  k8sServiceHost: 192.168.55.10
  k8sServicePort: 6443
  operator:
    replicas: 1
  hubble:
    enabled: true
    tls:
      auto:
        enabled: true
        method: helm
    relay:
      enabled: true
    ui:
      enabled: true
      ingress:
        enabled: true
        ingressClassName: traefik
        hosts:
          - hubble.cluster.kcn333.com
        tls:
          - secretName: hubble-tls
            hosts:
              - hubble.cluster.kcn333.com

The UI and cert are deployed; relay is the blocker.

Lessons from the debugging marathon¶

Check network connectivity before assuming TLS issues — openssl s_client from node to node showed TLS was fine; the packets just never arrived
Switching kubeProxyReplacement: true without also setting routingMode: native causes instability — Cilium tries to take over services but doesn't have full native routing
--disable-kube-proxy in k3s only makes sense when Cilium is in full kubeProxyReplacement: true mode
Rolling update of the Cilium DaemonSet is not safe for routing mode changes — one node ends up with a different mode, traffic breaks

Useful Commands¶

# Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
# → http://localhost:9090/targets

# AlertManager status
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/status | python3 -m json.tool | grep -A5 "routes"

# Check active alerts
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool | grep -E "alertname|severity|state"

# PrometheusRules
kubectl get prometheusrule -A
kubectl describe prometheusrule homelab-custom-rules -n monitoring

# Loki query via CLI
kubectl port-forward -n loki svc/loki 3100:3100 &
curl -s "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={namespace="kube-system"}' | python3 -m json.tool | head -50

# Hubble / Cilium
kubectl -n kube-system exec ds/cilium -- cilium status | grep -E "Hubble|Cluster health"
kubectl logs -n kube-system deployment/hubble-relay --tail=20

# Flux
flux get helmrelease kube-prometheus-stack -n monitoring
flux reconcile helmrelease kube-prometheus-stack -n monitoring