Skip to main content

Version: Next

Real-time GPU Usage

Real-time monitoring allows you to track GPU utilization, memory usage, and resource allocation across your Kubernetes cluster as workloads run. HAMi provides tools to observe GPU behavior dynamically.

Monitoring with kubectl

Check Node GPU Resources

View current GPU capacity and allocatable resources on a node:

kubectl get node <node-name> -o json | jq '.status.allocatable' | grep -i gpu

Inspect Pod GPU Allocation

See which GPUs are allocated to a specific pod:

kubectl get pod <pod-name> -o json | jq '.metadata.annotations' | grep -i gpu

Or view all GPU-related information:

kubectl describe pod <pod-name>

Monitoring Inside Containers

Check Allocated GPU Inside Pod

Inside a running container, you can check which GPUs are visible:

kubectl exec -it <pod-name> -- nvidia-smi

This shows the virtual GPU configuration as seen by the container, including allocated memory and cores.

Real-time GPU Usage

To monitor GPU usage while a workload runs:

kubectl exec -it <pod-name> -- watch -n 1 nvidia-smi

This updates the GPU metrics every second, showing:

GPU utilization percentage
Memory usage and limits
Running processes

Node-Level Monitoring

Monitor All GPUs on a Node

SSH into the node and run:

nvidia-smi

For continuous monitoring:

watch -n 1 nvidia-smi

Check HAMi Device Plugin Status

Verify the HAMi device plugin is running and reporting resources:

kubectl get pods -n kube-system | grep hami
kubectl logs -n kube-system -l app=hami-scheduler -f

Resource Annotation Tracking

HAMi stores GPU information in node annotations. View them with:

kubectl get node <node-name> -o yaml | grep -A 10 "hami.io/node"

This shows detailed GPU information including:

GPU UUIDs
Memory capacity
Compute core count
Device models

Integration with Monitoring Tools

For production environments, integrate HAMi with tools like:

Prometheus: Scrape kubelet metrics for GPU resource data
Grafana: Visualize GPU utilization trends over time
Kubernetes Dashboard: View GPU resources in the web UI

Refer to the Kubernetes documentation for setting up monitoring with these tools.

Troubleshooting

If you notice GPU allocation inconsistencies:

Check pod resource requests/limits match HAMi annotations
Verify the HAMi scheduler is running
Check device plugin logs for errors
Ensure nodes have the required GPU labels

For more details, see the troubleshooting guide.

Monitoring with kubectl
- Check Node GPU Resources
- Inspect Pod GPU Allocation
Monitoring Inside Containers
- Check Allocated GPU Inside Pod
- Real-time GPU Usage
Node-Level Monitoring
- Monitor All GPUs on a Node
- Check HAMi Device Plugin Status
Resource Annotation Tracking
Integration with Monitoring Tools
Troubleshooting

CNCF

HAMi is a CNCF Sandbox project