Kubernetes

Kubernetes (K8s): a orchastration tool for containers in a cluster

scale up and down nodes
backup and restore data once failed
workload distribution accross nodes

Because the word "Kubernetes" is hard to spell right, and there are 8 letters "ubernete" between "K" and "s", "K8s" abbreviation is used.

There are many other tools in container ecosystems:

Prometheus is a monitoring tool for clusters.
Dashboard is basic UI, inside cluster, bad, don't use.
Lens is IDE-like dashboard on various OS.
grafana is a dashboard UI.
k9s is a command line dashboard UI
Terraform is a cloud service setup tool to replace graphical interface of each cloud service provider.

Installation

Note that the hostname of a node cannot contain upper case letters to use Kubernetes. Use hostnamectl set-hostname to change hostname.

Add the following lines to /etc/docker/daemon.json:

{
    "insecure-registries" : ["localhost:32000"]
}

and then restart docker with: sudo systemctl restart docker

Install Microk8s

Assuming you have Ubuntu 22.04:

#!/bin/sh

# make sure docker is installed and we have permission
sudo usermod -a -G docker $USER
newgrp docker

# install microk8s to run locally
# stable version of 1.27
sudo snap install microk8s --classic --channel=1.27/stable
# setup user group
sudo usermod -a -G microk8s $USER
# refresh permission group
newgrp microk8s
# give access
sudo chown -f -R $USER ~/.kube
# check installation
microk8s status --wait-ready

# now install kubectl, although microk8s has its own, we will use system's kubectl
sudo apt-get install -y ca-certificates curl apt-transport-https
# get key
sudo mkdir /etc/apt/keyrings/
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-archive-keyring.gpg
# add repo
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
# install kubectl
sudo apt-get update && sudo apt-get install -y kubectl

# now we need to connect kubectl with microk8s
# this might need to be reset when your network status changes
# by making ~/.kube/config
microk8s config > ~/.kube/config
# should display correctly
kubectl cluster-info

# install helm
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install -y apt-transport-https
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install -y helm

Install K9s

Go to Here and download the k9s_Linux_amd64.tar.gz file, and put the file into /usr/local/bin. That's it.

Basic Concepts

Master Node: manage Cluster Network, can have multiple master nodes

API server (entrypoint): UI, CLI, talk to etcd
Controller Manager: Kubernetes internal states, controls other controllers
- node controller
- replication controller
- endpoint controller
- service account & token controller
Cloud Control Manager: everything has to do with 3rd party cloud service
Kube Scheduler: distribution Pods to Nodes based on
- resource requirement
- hardware, software, policy constraints
- data locality
- affinity, anti-affinity: whether it is required or preferred to put or not put two types of Pods on the same machine
etcd: snapshot database for recovery, hold Status information for meeting Specification
addons: such as DNS, Dashboard, Cluster-level logging, resource monitoring

Worker Nodes: actual work, can have multiple containers in one node

Container Runtime: many implements Kubernetes Container Runtime Interface (CRI)
- Moby
- Containerd
- Cri-o
- Rkt
- Kata
- Virtlet
Kubelet: manage all pods lifecycle in worker node
Kube Proxy: manage network rules
Pods

Starting V1.19, Moby isn't needed as you will use crictl to run command directly in a node. Docker, before version 20.10, has a container runtime, but they started to use containerd as their underlying container runtime for better modularity.

Node Pool: a group of virtual machines of the same size

Application: e.g. docker, database

Cluster IP: virtual IP address in Cluster Network

Pod: an abstraction of containers (so you don't have to always use docker), ephemeral

usually one Application per Pod
IP: each pod will get a new Cluster IP address upon creation
Service: a static Cluster IP address, that is not ephemeral, usually a Service per Application
- External Service: an IP address exposed (goes to Ingress)
- Internal Service: an IP address not exposed (how Pods communicates)
- Service will forward to IP to create load balance

ConfigMap & Secrets: like a per-cluster .env file

Volume: a storage (either local or remote) that is attached to a Pod for consistent data storage like database (because container themselves can't store persistent data)

Kubernetes doesn't manage data persistance

Deployments: blueprint for how to create a kind of Pods at scale.

Don't use Deployments to create Pods that is stateful (e.g. database) and need global consistency. Use StatefulSet instead.

StatefulSet: DB replication, scaling, synchronized read and writes.

In practice, don't use StatefulSet. You probably want to use 3rd party service for that.

Configuration

Master can be accessed through

UI: grafana or dashboard
API: RESTful
CLI: kubectl

Configuration has 3 components

manually specified Kind: e.g. Deployment, Service
manually specified Metadata
automatically generated Specification

We directly use yaml or json configuration to launch Service, Deployment (which creates Pods), Secrets, and ConfigMap.

Use kubectl get all, kubctl get secret, and kubctl get configmap to see all launched Components. Use kubctl get node -o wide to see info of nodes.

You can generate .yaml config by using kubectl --dry-run=client -o yaml

Configurations are hard to write correct, here are some great references:

Docker Containers and Kubernetes Fundamentals – Full Hands-On Course

Learn more about configurations Here

Cluster Simulation

Tools:

K3s: lightweight, small production projects, pre-packed opinion
K3D: fast, preserve cluster state, wrapper of K3s in docker
MicroK8s: easy, quick feature enablement
Minikube: requires Hypervisor (like VirtualBox)
Docker Desktop: only one node, support Mac, Windows, heavy
Kind: simple, basic, can't preserve cluster state, runs on Docker Desktop
K0s

KubeCtl

Context: which cluster you are working with and metadata associated to that context. Information stored in ~/.kube/config

Context Commands

kubectl config current-context # show which context is used currently
kubectl config get-context # show all context avaliable
kubectl config use-context [contextName] # switch to context
kubectx # a short name for "kubectl config"
kubectl config rename-context [oldName] [newName]
kubectl config delete-context [contextName]

Namespace Commands

kubectl get namespace # List all Namespace
kubectl get ns # List all Namespace
kubectl config set-context --current --namespace=[namespaceName] # switch to Namespace, all commands will send to that Namespace
kubectl create namespace [namespaceName]
kubectl delete namespace [namespaceName] # delete Namespace and all related Components
kubectl get pods --all-namespaces # list all pods in all Namespaces
kubectl get pods --namespace=[namespaceName] # list all pods in specific Namespace
kubectl get pods -n [namespaceName] # list all pods in specific Namespace

You can put namespace on any Component under the metadata section (in addition to name), so you can delete them in a batch. (Namespace is on its own a Component)

Labels: self-defined tags for each Components, work with Selectors Selectors: use labels to filter or select Components, used to refer to Components (also you can pass --selector=[name] to get commands to filter stuff)

Pods Commands

kubectl create -f [fileName.yaml] # create a pod
kubectl get pods -o wide # print all pods
kubectl describe pod [podName] # show pod info
kubectl get pod [podName] -o yaml # extract pod definition in yaml
kubectl exec -it [podName] -- sh # launch interactive `sh` shell in a pod
kubectl exec -it [podName] -c [containerName] -- sh # for multi-container pod
kubectl logs [podName] -c [containerName] # getting logs for a container
kubectl delete -f [fileName.yaml] # delete a pod
kubectl delete pod [podName] # delete a pod in 30s
kubectl delete pod [podName] --wait=false # delete a pod with no wait
kubectl delete pod [podName] --grace-period=0 --force # force delete a pod
kubectl port-forward [kind]/[serviceName] [podPort]:[localPort] # ssh forward service

Node Commands

kubectl get nodes # see all Nodes
kubectl describe nodes [nodeName] # see status of a Node

Pods

Pod State:

Pending: accepted, not yet created
Running: bound to a node
Succeeded: exited with status 0
Failed: all containers exit and at least one exited with a non-zero status
Unknown: can't communicate
CrashLoopBackOff: repeatably starting and crashing

Init Containers: executable logics to run before actual app starts to run (install dependencies)

run one after the other, if one failed, restart
will not get probes (livenessProbe, readinessProbe, startupProbe)

Networking

IP Address

Containers in a Pod share IP address and Volume
Different Pods communicate through Service if within Cluster
External communication outside of Cluster goes through LoadBalancer (usually Cloud provider)

Workload

Workload: a Component that does heavy work, here are the inheritage

Pod: basic type of workload
- ReplicaSet: automatic replicate and ensure there are number of Pods running inside ReplicaSet (self-heal, scaling)
  - Deployment: updates, rollback
- StatefulSet: for stateful Pods, need to have Service, very complicated
- DeamonSet: ensure all nodes runs an instance of a Pod (scheduled by scheduler controller, run by daemon controller)
- Job: run to number of successful completion (Pod will remain even when job succeed)
- CronJob: repeat scheduled jobs (history will keep logs of last 3 successful jobs and last 1 failed job)

Since setting up database is hard and inefficient, I would prefer not to setup a 3rd party cloud service. So nothing is covered on this tutorial.

ReplicaSets Commands

kubectl get rs # list ReplicaSets
kubectl describe rs [replicaSetName]
kubectl delete -f [fileName.yaml]
kubectl delete rs [replicaSetName]

Deployment:

replicas: number of pod instances
revisionHistoryLimit: number of previous iterations to keep
strategy.type = RollingUpdate: cycle through to update pods (also see maxSurge, maxUnavailable, default to 25%)
strategy.type = Recreate: kill all pods before creating new ones

Deployment Commands

kubectl get rs # list ReplicaSets
kubectl get deploy # list ReHTTPplicaSets
kubectl describe deploy [deploymentName]
kubectl delete -f [fileName.yaml]
kubectl delete deploy [deploymentName]

kubectl apply -f [fileName.yaml] # to update a deployment (not `create`)
kubectl rolllout status # get the progress of the update
kubectl rollout history deployment [deploymentName] # get history
kubectl rollout undo [deploymentName] # rollback a deployment
kubectl rollout undo [deploymentName] --to-revision=[revisionNumber] # rollback to a revision

DaemonSet Commands

kubectl get ds # list ReplicaSets
kubectl describe ds [daemonSetName]
kubectl delete -f [fileName.yaml]
kubectl delete ds [daemonSetName]

Job Commands

kubectl get job # list ReplicaSets
kubectl describe job [jobName]
kubectl delete -f [fileName.yaml]
kubectl delete job [jobName]

CronJob Commands

kubectl get cj # list ReplicaSets
kubectl describe cj [cronJobName]
kubectl delete -f [fileName.yaml]
kubectl delete cj [cronJobName]

Blue-Green Deployments

Blue: Pods in production v1 Green: Pods in new version v2

Blue-Green Deployment: when everything is ready, switch selector of Service from v1 to v2 (switch blue and green)

downtime is minimal
requires 2x many nodes in cluster

Service

Service: for both external and internal communication

update Service with kubectl apply -f [fileName.yaml] command.
static IP address
static DNS name ([serviceName].[namespace].svc.cluster.local)
types:
- ClusterIP (default): internal, reach by http://[serviceName]:[port], specify port (internal Service port) and targetPort (Pod port)
  - NodePort: internal, external, nodePort (external) statically defined or chosen from 30000-32767, reached by Node's ip.
  - LoadBalancer (TCP transport layer4 Load Balancing): unable to make decision based on content, round-robin
  - Ingress (TCP application layer7 Load Balancing): distinguish HTTP, SMTP routing rules

Service Commands

kubectl apply -f [fileName.yaml] # to deploy a service
kubectl get svc # list ReplicaSets
kubectl describe svc [serviceName]
kubectl delete -f [fileName.yaml]
kubectl delete svc [serviceName]

Volumes

Volumes: cluster-wide storage system outside of cluster (provided by 3rd party service), vendors create plugins according to Container Storage Interface

PersistentVolumeClaim: an interface and abstraction over storage, that specify how much storage a Pod would use at maximum
PersistentVolume: cluster-wide static volumes, only one Persistent Volume Claim is allowed for one Persistent Volume, need to specify capacity
StorageClass: cluster-wide dynamic volumes, multiple Persistent Volume Claim is allowed for one Storage Class, no need to specify capacity

Reclaim Policies:

Delete: delete data upon pods deletion (default)
Retain: keep data upon pods deletion

Access Modes:

ReadWriteMany: all Pods can read and write
ReadOnlyMany: all Pods can only read
ReadWriteOnce: read-write for single Pod (that mount first), and read-only for other

States:

Avaliable: no PersistentVolumeClaim is created for a PersistentVolume or StorageClass
Bound: there are at least one PersistentVolumeClaim is created for a PersistentVolume or StorageClass
Released: PersistentVolumeClaim is deleted, but resource is not yet reclaimed by the cluster
Failed: failed

ConfigMap

ConfigMap: Externalize Environment Configuration as a Component, can create from

Manifests
Files
Directories

To see the changes, containers have to restart and K8s will inject the environmental value. You could mount Volume in ConfigMap to make it non-static, but you need to read from files.

ConfigMap Commands

kubectl apply -f [fileName.yaml] # to deploy a ConfigMap
kubectl create cm [configMapName] --from-file=[fileName.txt] # create imperatively from file
kubectl create cm [configMapName] --from-file=[directory/] # create imperatively from directory
kubectl get cm
kubectl get cm [configMapName] -o YAML # save configMap as yaml file
kubectl delete -f [fileName.yaml]

Secrets

Secrets: stored as base64 encoded string (not secure)

should protect with role-based access control (RBAC)
could store secrets elsewhere
- Azure Key Vault
- AWS Key Management Service
- Google Cloud KMS
- HashiCorp Vault

Probes

Probes: for a Pod to know the status of a application inside container

Types of Probes
- Startup Probes
- Readiness Probes: stop traffic
- Liveness Probes: restart container
Checking Methods
- ExecAction: execute a command inside container
- TCPSocketAction: check if TCP socket port is open
- HTTPGetAction: perform http GET against a port

Probes are not component, but used by Kubelet.

Horizontal Pod Autoscaling

HorizontalPodAutoscaling Commands

kubectl get hpa [horizontalPodAutoscalingName]
kubectl delete hpa [horizontalPodAutoscalingName]
kubectl delete -f [fileName.yaml]

To use it, you need to install metrics-server (and add --kubelet-insdecure-tls if on DockerDesktop)

Kubernetes in Home Cluster

Firstly, make sure you have no swap

sudo swapoff -a
sudo nano /etc/fstab # and remove swap

Then install kubeadm kubelet kubectl kubernetes-cni after you set up repository following this guide on both worker nodes and the master node.

Then run sudo kubeadm init only on master node, then join the worker nodes with the command output.

Then, copy configuration file to $HOME/.kube/config on the master node (see Installation)

If you have problems [ERROR CRI]: container runtime is not running, then run the following

sudo mv /etc/containerd/config.toml /etc/containerd/config.toml.bak
sudo systemctl restart containerd
sudo kubeadm init

In the end, you will get

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 104.171.203.218:6443 --token lzwo36.sjbuvacn3aube42l \
        --discovery-token-ca-cert-hash sha256:b3b3db068180a8a7edf1d3025e77734974af735b421de4396ddd131ce61f4a65

Use the kubeadm join commands on the worker nodes.

Then you can see

❯ kubectl get nodes
NAME              STATUS     ROLES           AGE    VERSION
104-171-203-218   NotReady   control-plane   5m6s   v1.27.4

We need to install Calico for its network security solution on master node:

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

Then wait for a while, you should see the nodes are "Ready"

Nginx

You can choose directly deploy

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

Or use helm

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# install a new Helm package (chart) of type ingress controller and name it my-nginx-ingress
# Ingress resources will use external IP for routing traffic.
helm install my-nginx-ingress ingress-nginx/ingress-nginx --set controller.publishService.enabled=true

Then verify

kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --watch
NAME                                        READY   STATUS      RESTARTS   AGE
ingress-nginx-admission-create-bktwj        0/1     Completed   0          15s
ingress-nginx-admission-patch-rmtcg         0/1     Completed   1          15s
ingress-nginx-controller-679bdfb778-zd5x9   0/1     Running     0          15s

Nvidia CUDA GPU

It is kinda difficult to enable GPU on kubernetes for couple reasons:

the image with driver in it is huge (you may find docker system prune -a helpful, tar-ing file with pigz might speedup work but I never get the format to work)
there are many deprecated methods online that you don't know whether they work (such as nvidia-docker, nvidia-docker2 at Here or pytorch-operator)
different node vendor has different things (AMD, Intel, NVIDIA all provide k8s support)
image need to be compatible with node driver
things may depend on nvcc to compile
may not be simulator-agnostic
According to Here, you must use docker engine instead of containerd engine to enable GPU.

Follow Docker

Before proceed to next sections, you need to make sure your Dockerfile is configured correctly. That means docker run -it --entrypoint /bin/bash [image-hash] should allow you to access nvidia-smi and pytorch should acknowledge the GPU. Note that you must use image hash for local images. (Also local images should not be tagged latest, otherwise it will always be pulled from dockerhub)

You can check GPU avaliability by:

import torch
torch.cuda.is_available()

To remove docker container, do docker ps -a, then docker remove [containerId]

Follow `k8s-device-plugin`

Follow k8s-device-plugin instructions.

Make sure which nvidia-container-toolkit and which nvidia-container-runtime exist (you might want to reinstall nvidia-container-toolkit) no matter which solution you are following.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Edit /etc/docker/daemon.json to be: and then sudo systemctl restart docker (or you may choose to run sudo nvidia-ctk runtime configure instead as it might auto configure this file for you?)

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Edit /etc/containerd/config.toml to be: and then sudo systemctl restart containerd

Also do sudo systemctl daemon-reload

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

You don't need to do kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml, as we choose to follow microk8s' setup. This is because following k8s-device-plugin does not work fully for me.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

If you do follow above and do a kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml (it will create a deamonset in a different namespace kubectl get namespaces) and look at its log, you will get [factory.go:115] Incompatible platform detected. If you ignore this problem and try to schedule a GPU job, you will get 0/1 nodes are available: 1 Insufficient nvidia.com/gpu

Follow Microk8s

Use microk8s inspect to check microk8s installation.

When you setup everything above, reboot the system, then do microk8s enable gpu, you will get:

Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
GPU 0: NVIDIA RTX A2000 8GB Laptop GPU (UUID: GPU-11886664-6f9b-bb01-311a-573dffe8aee4)
Using host GPU driver
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: gpu-operator
LAST DEPLOYED: Mon Sep  4 06:29:55 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

The command will also add many pods. You should wait for a long time (like 5min), then do microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator to verify successful configuration. You should get all validations are successful.

After running klzzwxh:0065 — After running `microk8s enable gpu`

Now, when you do kubectl describe node -A | grep nvidia, you should see:

                    nvidia.com/cuda.driver.major=520
                    nvidia.com/cuda.driver.minor=61
                    nvidia.com/cuda.driver.rev=05
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=8
                    nvidia.com/gfd.timestamp=1693823531
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=6
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=21D6004WUS
                    nvidia.com/gpu.memory=8192
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-RTX-A2000-8GB-Laptop-GPU
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1
  gpu-operator-resources      nvidia-container-toolkit-daemonset-kjp55                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
  gpu-operator-resources      nvidia-dcgm-exporter-kd852                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
  gpu-operator-resources      nvidia-device-plugin-daemonset-fq2rk                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
  gpu-operator-resources      nvidia-operator-validator-q9hz9                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
  nvidia.com/gpu     0           0

Summary

#!/bin/sh

# nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# microk8s GPU
microk8s inspect
microk8s enable gpu
sudo nvidia-ctk runtime configure
sudo systemctl restart containerd
sudo systemctl restart docker
sudo systemctl daemon-reload
microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator

# k9s
mkdir k9s
cd k9s
wget https://github.com/derailed/k9s/releases/download/v0.27.4/k9s_Linux_amd64.tar.gz
tar -xzf k9s_Linux_amd64.tar.gz
sudo cp k9s /usr/local/bin/k9s
cd ..
rm -rf k9s

Other Microk8s Instructions

For production and more advanced setup, follow Here

You might get the following error Normal Scheduled 2m21s default-scheduler Successfully assigned ***-77c979f8bf-px4v9 to gke-***-389a7e33-t1hl Warning Failed 20s kubelet Error: context deadline exceeded Normal Pulled 7s (x3 over 2m20s) kubelet Container image "***" already present on machine Warning Failed 7s (x2 over 19s) kubelet Error: failed to reserve container name ***-77c979f8bf-px4v9_***": name "***-77c979f8bf-px4v9_***" is reserved for "818fcfef09165d91ac8c86ed88714bb159a8358c3eca473ec07611a51d72b140"

See Here

To deal with this issue, you might want to increase load time by appending --runtime-request-timeout 30m0s (See all settings Here and instructions Here) to the file /var/snap/microk8s/current/args/kubelet. Then restart microk8s stop, microk8s start

To build and use images locally, you might do the following (don't use :latest tag):

docker save frontend:3.4 > tmp-image.tar
microk8s ctr image import tmp-image.tar

If you find your image in microk8s ctr images ls, then microk8s can load your image.

However, docker save and docker load is often threshold by io, so we build a local registry: But local registry is faster to build, but slower for ContainerCreating. So in development, don't use local registry.

# we need to enable this local registry function before hand
microk8s enable registry:size=20Gi

# to directly build image into local registry
docker build . -t localhost:32000/image_name:image_tag
# or import from existing image
docker tag some_hash_tag localhost:32000/image_name:image_tag

# finally push to registry
docker push localhost:32000/image_name

And then use image: localhost:32000/image_name:image_tag in kubernetes. To check built image, checkout curl http://localhost:32000/v2/_catalog

Pushing to this insecure registry may fail in some versions of Docker unless the daemon is explicitly configured to trust this registry. To address this we need to edit /etc/docker/daemon.json and add, then do sudo systemctl restart docker: { "insecure-registries" : ["localhost:32000"] } (also if docker failed to restart, you might check out dockerd for error messages)

See Here for details.

Cleanup

K8s occupies a lot of disk space. We want to clean up.

To remove all stopped docker containers: docker rm $(docker ps -aq) (usually docker ps only display running containers, -a will include stopped). It will automatically disallow you to remove running containers (or docker rm $(docker ps --filter "status=exited" -q) to exclude dead and created containers). -q means only display container id.

Then do docker image prune to use unused ones if you don't want to remove them using docker image rm manually. To be more aggressive, use docker system prune:

WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all dangling images
  - all dangling build cache

Similar to docker images, you can also do microk8s ctr images ls to see images imported to microk8s.

To remove all images in the microk8s: microk8s ctr images rm $(microk8s ctr images ls name~='localhost:32000' | awk {'print $1'})

Persistent Volume

I am too lazy to explain, but Here is a good video. Note that the volume persists even you delete it. You cannot change sizes of your configuration .yaml between delete and apply, unless your volume is empty.

Memory

Sometimes, application may request /dev/shm instead of actual memory. So you might need to do the following (see Here, Here, and Here)

volumeMounts:

- mountPath: /dev/shm
  name: dshm

...

volumes:

- name: dshm
  emptyDir:
    medium: Memory
    sizeLimit: 32Gi

Expose

To expose ports, you need: microk8s enable ingress. Documentation is Here. But it doesn't work since I am not using standard 80 and 443 for http web requests.

So since I am on bearmetal (not on AWS, GCP), I don't have load-balancer service. I need to use MetalLB in order to use LoadBalancer, otherwise, you will get pending if you don't specify the externalIPs field.

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.11/config/manifests/metallb-native.yaml

Then create these two configs

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: metallb-pool
  namespace: metallb-system
spec:
  addresses:
  - 104.171.203.12-104.171.203.12

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: metallb-l2ad
  namespace: metallb-system

You can use command kubectl get service -n ingress-nginx to see if ingress-nginx-controller (LoadBalancer) has its external ip assigned to 104.171.203.12. Make sure you read Concepts

But you don't have to use LoadBalancer, NodePort would expose your port to host machine as long as you specify nodePort in 30000 range. This video is helpful for setting up LoadBalancer. Your containerPort should match targetPort, and your port is what other pods can talk to this port internally in the cluster, and your nodePort is what enables outside traffics.

Setting up a ssh server inside k8s pod is another annoying task:

You need to make sure you generate ssh-keygen -A, otherwise ssh server cannot start and CMD ["/usr/sbin/sshd", "-De"] will fail. Also, you cannot generate these inside container because that will give you the same keys each time.
You need to allow root user login
You need add authorized_keys without bundling them into Dockerfile, you can do it with Secrets, but that will give you read-only filesystem with permission too broad and ssh-server reject due to this reason.
The permission is correct for every steps above

I have a private implementation at Here. I ended up with executing command using lifecycle.postStart.exec.command to append authorized_keys, and mount locally the generated keys into a different place using secrets then change the location of ssh_host_* to the mounted locations.

Certificates

In order to enable TLS/SSL, you need to add CA-signed certificates to k8s. We will use cert-manager to do that.

We first install cert-manager in cert-manager namespace.

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.3/cert-manager.yaml
kubectl get pods --namespace cert-manager
# there should be cert-manager, cert-manager-cainjector, and cert-manager-webhook

WARNING: to delete cert-manager, please follow Guide carefully otherwise you might break your machine permanently.

Then we need to add ACME issuer (Automated Certificate Management Environment (ACME) Certificate Authority server). cert-manager offers two challenge validations - HTTP01 and DNS01 challenges.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    # You must replace this email address with your own.
    # Let's Encrypt will use this to contact you about expiring
    # certificates, and issues related to your account.
    email: [email protected]
    disableAccountKeyGeneration: false # use to true if you don't want to generate key
    server: https://acme-staging-v02.api.letsencrypt.org/directory # you can change to https://acme-v02.api.letsencrypt.org/directory for deployment instead of staging (testing)
    privateKeySecretRef:
      # Secret resource that will be used to store the account's private key. You don't need to create this resource.
      name: example-issuer-account-key
    # Add a single challenge solver, HTTP01 using nginx
    solvers:
    - http01:
        ingress:
          ingressClassName: nginx

kubectl get clusterissuer

cert-manager uses your existing Ingress or Gateway configuration in order to solve HTTP01 challenges.

Then re-configure the ingress like the example. You should modify as described in the comment. Make sure all domain access will return an http response (not socket or anything else).

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    # add an annotation indicating the issuer to use.
    cert-manager.io/cluster-issuer: letsencrypt-staging
  name: myIngress
  namespace: myIngress
spec:
  rules:
  - host: ssl.kokecacao.me
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: myservice
            port:
              number: 80
  tls: # < placing a host in the TLS config will determine what ends up in the cert's subjectAltNames
  - hosts:
    - ssl.kokecacao.me
    secretName: myingress-cert # < cert-manager will store the created certificate in this secret. You don't need to create this resource.

Then you should get myingress-cert shown when you execute the following commands

kubectl apply -f ingress.yaml
kubectl get certificates --all-namespaces
kubectl get secrets --all-namespaces

kubectl describe certificaterequest
kubectl describe order

When everything is complete change server to https://acme-v02.api.letsencrypt.org/directory because LetsEncrypt staging is for testing.

To switch, you may want to execute

kubectl delete certificates myingress-cert
kubectl delete secrets myingress-cert

If you use cloudflare, make sure to set SSL/TLS encryption mode to strict

Table of Content