Kubernetes资源调度

1. 深入理解Pod对象：调度

创建一个Pod的工作流程
Pod中影响调度的主要属性
资源限制对Pod调度的英雄
nodeSelector & nodeAffinity
Tain（污点）& Tolerations（污点容忍）
nodeName

Kubernetes Scheduler 是 Kubernetes 控制平面的核心组件之一。它在控制平面上运行，将 Pod 分配给节点，同时平衡节点之间的资源利用率。将 Pod 分配给新节点后，在该节点上运行的 kubelet 会在 Kubernetes API 中检索 Pod 定义，根据节点上的 Pod 规范创建资源和容器。换句话说，Scheduler 在控制平面内运行，并将工作负载分配给 Kubernetes 集群。

2. 创建一个Pod的工作流程

Kubernetes基于list-watch机制的控制器架构，实现组件间交互的解耦。

其他组件监控自己负责的资源，当这些资源发生变化时，kube-apiserver会通知这些组件，这个过程类似于发布与订阅

3. nodeSelector（节点选择器）

nodeSelector : 用于将Pod调度到匹配Label的Node上，如果没有匹配的标签会调度失败。
作用：

约束Pod到特定的节点运行
完全匹配节点标签

nodeSelector 是节点选择约束的最简单推荐形式。nodeSelector 是 PodSpec 的一个字段。它包含键值对的映射。为了使 pod 可以在某个节点上运行，该节点的标签中必须包含这里的每个键值对（它也可以具有其他标签）。最常见的用法的是一对键值对。

执行 kubectl get nodes 命令获取集群的节点名称。选择一个你要增加标签的节点，然后执行 kubectl label nodes = 命令将标签添加到你所选择的节点上。例如，如果你的节点名称为 ‘kubernetes-foo-node-1.c.a-robinson.internal’ 并且想要的标签是 ‘disktype=ssd’，则可以执行 kubectl label nodes kubernetes-foo-node-1.c.a-robinson.internal disktype=ssd 命令。

你可以通过重新运行 kubectl get nodes --show-labels，查看节点当前具有了所指定的标签来验证它是否有效。你也可以使用 kubectl describe node “nodename” 命令查看指定节点的标签完整列表
添加 nodeSelector 字段到 Pod 配置中
添加一个 nodeSelector 部分

//查看 node 的 label
[root@master ~]# kubectl get nodes --show-labels
NAME                STATUS   ROLES                  AGE     VERSION   LABELS
master              Ready    control-plane,master   5d10h   v1.23.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
node1.example.com   Ready    <none>                 5d10h   v1.23.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux
node2.example.com   Ready    <none>                 5d10h   v1.23.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node2.example.com,kubernetes.io/os=linux

//节点添加标签
[root@master ~]# kubectl label nodes node2.example.com disktype=ssd

//写入disktype=ssd
[root@master manifest]# vim test.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - name: b1
    image: busybox
    command: ["/bin/sh","-c","sleep 9000"]
    env:
    - name: xx
      valueFrom:
        fieldRef:
          fieldPath: status.podIPs
  nodeSelector:
    disktype: ssd

[root@master manifest]# kubectl create -f test.yaml 
pod/test created
[root@master manifest]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
test   1/1     Running   0          8s
//标签选择到指定主机上面，不会变
[root@master manifest]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP            NODE                NOMINATED NODE   READINESS GATES
test   1/1     Running   0          58s   10.197.3.79  node2.example.com   <none>           <none>

4. 节点亲和性

nodeAffinity：
节点亲和性，与 nodeSelector作用一样，但相比更灵活，满足更多条件，诸如：

匹配右更多的逻辑组合，不只是字符串的完全相等
调度分为软策略和硬策略，而不是硬性要求
- 硬（required）：必须满足
- 软（preferred）：尝试满足，但不保证
  操作符：ln、Notln、Exists、DoesNotExist、Gt、Lt

操作符	作用
In	label 的值在某个列表中
NotIn	label 的值不在某个列表中
Gt	label 的值大于某个值
Lt	label 的值小于某个值
Exists	某个 label 存在
DoesNotExist	某个 label 不存在

nodeAffinity 相对应的是 Anti-Affinity，就是反亲和性，这种方法比上面的 nodeSelector更加灵活，它可以进行一些简单的逻辑组合了，不只是简单的相等匹配。调度可以分成软策略和硬策略两种方式。

软策略就是如果你没有满足调度要求的节点的话，Pod 就会忽略这条规则，继续完成调度过程，说白了就是满足条件最好了，没有的话也无所谓了的策。

硬策略就比较强硬了，如果没有满足条件的节点的话，就不断重试直到满足条件为止，简单说就是你必须满足我的要求，不然我就不干的策略。

nodeAffinity就有两上面两种策略

preferredDuringSchedulingIgnoredDuringExecution：软策略
requiredDuringSchedulingIgnoredDuringExecution：硬策略

[root@master ~]# cat test2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: nginx
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: NotIn
            values:
            - node1
            - node2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: source
            operator: In
            values:
            - qikqiak

[root@master ~]# kubectl apply -f test2.yaml 
pod/with-node-affinity created
[root@master ~]# kubectl get pods
NAME                 READY   STATUS    RESTARTS   AGE
test-busybox         1/1     Running   0          15m
with-node-affinity   1/1     Running   0          59s

5. Taint（污点）& Tolerations（污点容忍）

节点亲和性是 Pod 的一种属性，它使 Pod 被吸引到一类特定的节点（这可能出于一种偏好，也可能是硬性要求）。污点（Taint）则相反——它使节点能够排斥一类特定的 Pod。

容忍度（Toleration）是应用于 Pod 上的，允许（但并不要求）Pod 调度到带有与之匹配的污点的节点上。

污点和容忍度（Toleration）相互配合，可以用来避免 Pod 被分配到不合适的节点上。每个节点上都可以应用一个或多个污点，这表示对于那些不能容忍这些污点的 Pod，是不会被该节点接受的。

概念

您可以使用命令 kubectl taint 给节点增加一个污点。比如，

kubectl taint nodes node1 key1=value1:NoSchedule

给节点 node1 增加一个污点，它的键名是 key1，键值是 value1，效果是 NoSchedule。这表示只有拥有和这个污点相匹配的容忍度的 Pod 才能够被分配到 node1 这个节点。

若要移除上述命令所添加的污点，你可以执行：

kubectl taint nodes node1 key1=value1:NoSchedule-

您可以在 PodSpec 中定义 Pod 的容忍度。下面两个容忍度均与上面例子中使用 kubectl taint 命令创建的污点相匹配，因此如果一个 Pod 拥有其中的任何一个容忍度都能够被分配到 node1 ：

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
  tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoSchedule"

例子

[root@master ~]# kubectl describe node master
CreationTimestamp:  Sat, 18 Dec 2021 11:50:26 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule	// 默认为不可调度
Unschedulable:      false
Lease:
  HolderIdentity:  master

// 删除就是在添加污点的后面加一个减号"-"
[root@master ~]# kubectl taint node master node-role.kubernetes.io/master-
node/master untainted

[root@master ~]# kubectl describe node master
CreationTimestamp:  Sat, 18 Dec 2021 11:50:26 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  master

//添加污点
 删除就是在添加污点的后面加一个减号"-"
[root@master ~]# kubectl taint node master node-role.kubernetes.io/master-
node/master untainted

[root@master ~]# kubectl describe node master
CreationTimestamp:  Sat, 18 Dec 2021 11:50:26 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  master

//查看各节点的污点
[root@master ~]# kubectl get nodes
NAME                STATUS   ROLES                  AGE     VERSION
master              Ready    control-plane,master   5d11h   v1.23.1
node1.example.com   Ready    <none>                 5d11h   v1.23.1
node2.example.com   Ready    <none>                 5d11h   v1.23.1

[root@master ~]# kubectl describe node  master | grep Taints:
Taints:             node-role.kubernetes.io/master:NoSchedule

[root@master ~]# kubectl describe node  node1 | grep Taints: 
Taints:             <none>


// 添加污点
[root@master ~]# kubectl taint node node1.example.com env_wu=yes:NoSchedule
node/node1.example.com tainted
[root@master ~]# kubectl describe node node1 | grep Taints:Taints:             env_wu=yes:NoSchedule

# 格式：
# key=value:值
# k/v都可以自定义

添加NoSchedule污点后，节该节点不会调度pod运行；如果该节点已经运行pod，也不会删除

// 测试一下
创建一个test容器，复制成两个
[root@master ~]# kubectl create deployment test --image=nginx
deployment.apps/test created

[root@master ~]#kubectl create deployment test--replicas=2
deployment.apps/test created

// 全部由node2运行
[root@master ~]# kubectl get pod -o wide
NAME                   READY   STATUS              RESTARTS      AGE   IP            NODE                NOMINATED NODE   READINESS GATES
test-8499f4f74-nnkn4   1/1     Running             0             38s   10.244.2.72   node2.example.com   <none>           <none>
test-8499f4f74-r2w9l   1/1     Running   		   0           

// 删除污点，重新扩展Pod
[root@master ~]# kubectl taint node node1.example.com env_wu=yes:NoSchedule-
node/node1.example.com untainted
[root@master ~]# kubectl describe node node1 | grep Taints:Taints:             <none>

[root@master ~]# node/node1.example.com untainted

CreationTimestamp:  Sat, 18 Dec 2021 12:00:45 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  node1.example.com

Tolerations（污点容忍）

// 添加污点/NoExecute （驱逐）
[root@master ~]# kubectl taint node node01 env_wu=yes:NoExecute
node/node1.example.com tainted

[root@master ~]# vi test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-ton
spec:
  selector:
    matchLabels:
      app: test-ton
  replicas: 3
  template:
    metadata:
      labels:
        app: test-ton
    spec:
      containers:
      - name: nginx
        image: nginx

// 没有node1节点Pod
[root@master ~]# kubectl get pod -o wide  //node01无pod
NAME                        READY   STATUS              RESTARTS        AGE   IP            NODE     			  NOMINATED NODE   READINESS GATES
test-ton-6798d496c7-2rfnf   1/1     Running  			  0               25s    <none>        node2.example.com   <none>           <none>
test-ton-6798d496c7-2x6j9   1/1     Running  			  0               25s    <none>        node2.example.com   <none>           <none>
test-ton-6798d496c7-r66wk   1/1     Running   		  	  0               25s    <none>        node2.example.com   <none>           <none>

// 添加容忍
// 编写yaml文件
[root@master ~]# vi test2.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-ton
spec:
  selector:
    matchLabels:
      app: test-ton
  replicas: 3
  template:
    metadata:
      labels:
        app: test-ton
    spec:
      containers:
      - name: nginx
        image: nginx
      tolerations:        		// 容忍
      - key: "env_wu"       	// 设置污点的key
        operator: "Equal"
        value: "yes"        	// 设置污点的values
        effect: "NoExecute" 	// 设置污点类型
        
// 应用yaml文件
[root@master ~]# kubectl apply -f ss.yaml
Warning: resource deployments/test-ton is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.	// 警告可以不用管
deployment.apps/test-taint configured

[root@master ~]# kubectl get pod -o wide
NAME                        READY   STATUS              RESTARTS      AGE     IP            NODE     			NOMINATED NODE   READINESS GATES
test-ton-6798d496c7-2x6j9   1/1     Running             0             5m32s   10.244.2.64   node2.example.com   <none>           <none>
test-ton-749fbcf99f-jlf42   1/1     Running             0             40s     10.244.1.62   node1.example.com   <none>           <none>
test-ton-749fbcf99f-rtrbr   1/1     Running             0             21s     10.244.2.67   node2.example.com   <none>           <none>
test-ton-749fbcf99f-v86zc   0/1     ContainerCreating   0             3s      <none>        node1.example.com   <none>           <none>

6. nodename

nodeName是最简单的节点选择约束形式，但由于其局限性，通常不使用它。是 PodSpec 的一个域。如果它是非空的，调度程序将忽略 pod，并且在命名节点上运行的 kubelet 会尝试运行该 pod。因此，如果在 PodSpec 中提供，则它优先于上述节点选择方法。nodeName nodeName

使用来选择节点的一些限制是：nodeName

如果命名节点不存在，则不会运行 Pod，并且在某些情况下可能会被自动删除。
如果命名节点没有容纳 pod 的资源，则 Pod 将失败，其原因将指示原因，例如 OutOfmemory 或 OutOfcpu。
云环境中的节点名称并不总是可预测的或稳定的。
下面是使用该字段的 pod 配置文件的示例：nodeName

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-01