Lets try deploying on a GKE standard cluster
kubectl get deploy
exitCode: 0
stdout:
NAME READY UP-TO-DATE AVAILABLE AGE
hc-mistral 0/1 1 0 3m28s
kubectl describe deploy hc-mistral
exitCode: 0
stdout:
Name: hc-mistral
Namespace: amoeba-workers
CreationTimestamp: Wed, 20 Mar 2024 10:13:30 -0700
Labels: <none>
Annotations: deployment.kubernetes.io/revision: 1
Selector: app=mistral
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25%!m(MISSING)ax unavailable, 25%!m(MISSING)ax surge
Pod Template:
Labels: app=mistral
Containers:
model:
Image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cu121.py310
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Requests:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Environment:
TRANSFORMERS_CACHE: /scratch/models
Mounts:
/scratch from scratch-volume (rw)
Volumes:
scratch-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: standard-rwo
Volume:
Labels: type=kaniko-disk
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
OldReplicaSets: <none>
NewReplicaSet: hc-mistral-578688ff77 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 3m35s deployment-controller Scaled up replica set hc-mistral-578688ff77 to 1
kubectl get pods
exitCode: 0
stdout:
NAME READY STATUS RESTARTS AGE
build-mistral-hc-5dwr7 0/1 Pending 0 4h11m
hc-mistral-578688ff77-thss2 0/1 Pending 0 5m6s
kubectl describe pods hc-mistral-578688ff77-thss2
exitCode: 0
stdout:
Name: hc-mistral-578688ff77-thss2
Namespace: amoeba-workers
Priority: 0
Service Account: default
Node: <none>
Labels: app=mistral
pod-template-hash=578688ff77
Annotations: cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-20T17:14:03+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/hc-mistral-578688ff77
Containers:
model:
Image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cu121.py310
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Requests:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Environment:
TRANSFORMERS_CACHE: /scratch/models
Mounts:
/scratch from scratch-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v9595 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
scratch-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: standard-rwo
Volume:
Labels: type=kaniko-disk
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
kube-api-access-v9595:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-tesla-a100
cloud.google.com/gke-spot=true
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m21s default-scheduler 0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "hc-mistral-578688ff77-thss2-scratch-volume". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Warning FailedScheduling 4m48s (x2 over 5m19s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 4m48s cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector
So its not scaling up. Do we need to allow spot VMS on the cluster
Here are the docs for sport VMs
I think the error indicates there is a problem with the node selector on the pod
I suspect its an issue with
cloud.google.com/gke-spot
but why aren’t spot VMs being added to the cluster
gcloud container clusters list
exitCode: 0
stdout:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
dev us-west1 1.27.8-gke.1067004 104.198.99.216 e2-medium 1.27.8-gke.1067004 5 RUNNING
dev-standard us-west1 1.27.8-gke.1067004 34.168.122.59 e2-medium 1.27.8-gke.1067004 6 RUNNING
- I think we need node autoprovisioning to create new node pools
- Is it enabled?
- Yes it is is enabled but the Node AutoProvisioining profile didn’t include GPU resources
- So using the UI I edited the profile and added the resource A100 (40Gb)
kubectl delete pods hc-mistral-578688ff77-thss2
exitCode: 0
stdout:
pod "hc-mistral-578688ff77-thss2" deleted
kubectl get jobs
exitCode: 0
stdout:
NAME COMPLETIONS DURATION AGE
build-mistral-hc 0/1 4h24m 4h24m
kubectl delete jobs build-mistral-hc
exitCode: 0
stdout:
job.batch "build-mistral-hc" deleted
kubectl get pods
exitCode: 0
stdout:
NAME READY STATUS RESTARTS AGE
hc-mistral-578688ff77-smjzz 0/1 Pending 0 25s
kubectl describe pods hc-mistral-578688ff77-smjzz
exitCode: 0
stdout:
Name: hc-mistral-578688ff77-smjzz
Namespace: amoeba-workers
Priority: 0
Service Account: default
Node: <none>
Labels: app=mistral
pod-template-hash=578688ff77
Annotations: cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-20T17:32:08+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/hc-mistral-578688ff77
Containers:
model:
Image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cu121.py310
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Requests:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Environment:
TRANSFORMERS_CACHE: /scratch/models
Mounts:
/scratch from scratch-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-twmgw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
scratch-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: standard-rwo
Volume:
Labels: type=kaniko-disk
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
kube-api-access-twmgw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-tesla-a100
cloud.google.com/gke-spot=true
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 36s default-scheduler 0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "hc-mistral-578688ff77-smjzz-scratch-volume". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Warning FailedScheduling 3s (x2 over 35s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 3s cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector
Its still not triggering scale up
Here’s a link to the logs of the node pool scale
- No scale up is triggered
- But I think this refers to scaling the existing node pools
Link to all cluster autoscale logs base on logname
logName="projects/<PROJECT>/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"
What if we remove the spot VM selector
kubectl delete -f /Users/jlewi/git_notes/aiengineering/gpuserving/deployment.yaml
exitCode: 0
stdout:
deployment.apps "hc-mistral" deleted
kubectl create -f /Users/jlewi/git_notes/aiengineering/gpuserving/deployment.yaml
exitCode: 0
stdout:
deployment.apps/hc-mistral created
kubectl describe pods
exitCode: 0
stdout:
Name: hc-mistral-5d87f69f67-vhpzh
Namespace: amoeba-workers
Priority: 0
Service Account: default
Node: <none>
Labels: app=mistral
pod-template-hash=5d87f69f67
Annotations: cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-20T17:44:25+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/hc-mistral-5d87f69f67
Containers:
model:
Image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cu121.py310
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Requests:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Environment:
TRANSFORMERS_CACHE: /scratch/models
Mounts:
/scratch from scratch-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s6c48 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
scratch-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: standard-rwo
Volume:
Labels: type=kaniko-disk
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
kube-api-access-s6c48:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-tesla-a100
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 51s default-scheduler 0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "hc-mistral-5d87f69f67-vhpzh-scratch-volume". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Warning FailedScheduling 19s (x2 over 49s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 19s cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector
gcloud compute accelerator-types list --filter="zone:( us-west1-a us-west1-b us-west1-c )"
exitCode: 0
stdout:
NAME ZONE DESCRIPTION
nvidia-h100-80gb us-west1-a NVIDIA H100 80GB
nvidia-l4 us-west1-a NVIDIA L4
nvidia-l4-vws us-west1-a NVIDIA L4 Virtual Workstation
nvidia-tesla-p100 us-west1-a NVIDIA Tesla P100
nvidia-tesla-p100-vws us-west1-a NVIDIA Tesla P100 Virtual Workstation
nvidia-tesla-t4 us-west1-a NVIDIA T4
nvidia-tesla-t4-vws us-west1-a NVIDIA Tesla T4 Virtual Workstation
nvidia-tesla-v100 us-west1-a NVIDIA V100
nvidia-l4 us-west1-b NVIDIA L4
nvidia-l4-vws us-west1-b NVIDIA L4 Virtual Workstation
nvidia-tesla-a100 us-west1-b NVIDIA A100 40GB
nvidia-tesla-k80 us-west1-b NVIDIA Tesla K80
nvidia-tesla-p100 us-west1-b NVIDIA Tesla P100
nvidia-tesla-p100-vws us-west1-b NVIDIA Tesla P100 Virtual Workstation
nvidia-tesla-t4 us-west1-b NVIDIA T4
nvidia-tesla-t4-vws us-west1-b NVIDIA Tesla T4 Virtual Workstation
nvidia-tesla-v100 us-west1-b NVIDIA V100
ct5lp us-west1-c ct5lp
nvidia-l4 us-west1-c NVIDIA L4
nvidia-l4-vws us-west1-c NVIDIA L4 Virtual Workstation
So A100s are only available in us-west1-b * Could that be why its not scaling becuase its not available in all zones? * Lets try changing that to nvidia-l4 since that’s available in all zones
kubectl delete -f /Users/jlewi/git_notes/aiengineering/gpuserving/deployment.yaml
exitCode: 0
stdout:
deployment.apps "hc-mistral" deleted
kubectl apply -f /Users/jlewi/git_notes/aiengineering/gpuserving/deployment.yaml
exitCode: 0
stdout:
deployment.apps/hc-mistral created
kubectl describe pods
exitCode: 0
stdout:
Name: hc-mistral-855749bbff-8nwn4
Namespace: amoeba-workers
Priority: 0
Service Account: default
Node: <none>
Labels: app=mistral
pod-template-hash=855749bbff
Annotations: cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-20T17:51:17+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/hc-mistral-855749bbff
Containers:
model:
Image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cu121.py310
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Requests:
cpu: 8
ephemeral-storage: 10Gi
memory: 64Gi
nvidia.com/gpu: 1
Environment:
TRANSFORMERS_CACHE: /scratch/models
Mounts:
/scratch from scratch-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8qxf4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
scratch-volume:
Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass: standard-rwo
Volume:
Labels: type=kaniko-disk
Annotations: <none>
Capacity:
Access Modes:
VolumeMode: Filesystem
kube-api-access-8qxf4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-l4
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38s default-scheduler 0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "hc-mistral-855749bbff-8nwn4-scratch-volume". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Warning FailedScheduling 5s (x2 over 36s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 5s cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector
- I tried to add a node pool through the UI and I got a permission error
- Docs for creating a node pool
gcloud container node-pools create a100 --accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=latest --machine-type=a2-highgpu-1g --region us-west1 --cluster dev-standard --node-locations us-west1-b --min-nodes 0 --max-nodes 3 --enable-autoscaling
exitCode: 1
stderr:
Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag.
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=The user does not have access to service account "887891891186-compute@developer.gserviceaccount.com". Ask a project owner to grant you the iam.serviceAccountUser role on the service account.
- I fixed the permissions through the UI
gcloud container node-pools create a100 --accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=latest --machine-type=a2-highgpu-1g --region us-west1 --cluster dev-standard --node-locations us-west1-b --min-nodes 0 --max-nodes 3 --enable-autoscaling
- It looks like that is working but we need to add the spot flag to request spot vms
gcloud container node-pools create a100-spot --spot --accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=latest --machine-type=a2-highgpu-1g --region us-west1 --cluster dev-standard --node-locations us-west1-b --min-nodes 0 --max-nodes 3 --enable-autoscaling
Check the Image
kubectl get pods
exitCode: 0
stdout:
NAME READY STATUS RESTARTS AGE
hc-mistral-578688ff77-2lx2n 1/1 Running 0 5m39s
use kubectl exec to ssh into the pod and run nvdia-smi
kubectl exec hc-mistral-578688ff77-2lx2n -- /usr/local/nvidia/bin/nvidia-smi
exitCode: 0
stdout:
Wed Mar 20 21:17:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 29C P0 43W / 400W | 4MiB / 40960MiB | 0%!D(MISSING)efault |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+