Kubernetes - 集群监控
# 监控方案
# Heapster
Heapster 是容器集群监控和性能分析工具,天然的支持 Kubernetes 和 CoreOS。
Kubernetes 有个出名的监控 agent---cAdvisor
。在每个 kubernetes Node 上都会运行 cAdvisor,它会收集本机以及容器的监控数据(cpu、memory、filesystem、network、uptime)。
在较新的版本中,K8S 已经将 cAdvisor 功能集成到 kubelet 组件中。每个 Node 节点可以直接进行 web 访问。
# Weave Scope
Weave Scope (opens new window) 可以监控 kubernetes 集群中的一系列资源的状态、资源使用情况、应用拓扑、scale、还可以直接通过浏览器进入容器内部调试等,其提供的功能包括:
- 交互式拓扑界面
- 图形模式和表格模式
- 过滤功能
- 搜索功能
- 实时度量
- 容器排错
- 插件扩展
# Prometheus
Prometheus 是一套开源的监控系统、报警、时间序列的集合,最初由 SoundCloud 开发,后来随着越来越多公司的使用,于是便独立成开源项目。自此以后,许多公司和组织都采用了 Prometheus 作为监控告警工具。
目前最流行的就是 Prometheus,俗称 普罗米修斯。
# Prometheus 监控 k8s
安装 Prometheus 有两种方式:
- 自定义配置
- kube-prometheus
自定义配置类似于二进制安装 k8s,从头到尾安装。
kube-prometheus 类似于 kubeadm,可以快速安装。
运维的可以用自定义配置,这样充分了解 Prometheus 每一个组件。快速的就使用 kube-prometheus 安装。
# 自定义配置
创建 ConfigMap 配置
# 创建 prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
2
3
4
5
6
7
8
9
10
11
12
13
14
# 创建 configmap
kubectl create -f prometheus-config.yml
2
部署 Prometheus
# 创建 prometheus-deploy.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
labels:
name: prometheus
spec:
ports:
- name: prometheus
protocol: TCP
port: 9090
targetPort: 9090
selector:
app: prometheus
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: prometheus
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.2.1
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: "/etc/prometheus"
name: prometheus-config
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# 创建部署对象
kubectl create -f prometheus-deploy.yml
# 查看是否在运行中
kubectl get pods -l app=prometheus
# 获取服务信息
kubectl get svc -l name=prometheus
# 通过 http://节点ip:端口 进行访问
2
3
4
5
6
7
8
9
10
配置访问权限
# 创建 prometheus-rbac-setup.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
# 创建资源对象
kubectl create -f prometheus-rbac-setup.yml
# 修改 prometheus-deploy.yml 配置文件
spec:
replicas: 1
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
serviceAccount: prometheus
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# 升级 prometheus-deployment
kubectl apply -f prometheus-deployment.yml
# 查看 pod
kubectl get pods -l app=prometheus
# 查看 serviceaccount 认证证书
kubectl exec -it <pod name> -- ls /var/run/secrets/kubernetes.io/serviceaccount/
2
3
4
5
6
7
8
服务发现配置
# 配置 job,帮助 prometheus 找到所有节点信息,修改 prometheus-config.yml 增加为如下内容
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-nodes'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
- job_name: 'kubernetes-service'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: service
- job_name: 'kubernetes-endpoints'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
- job_name: 'kubernetes-ingress'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: ingress
- job_name: 'kubernetes-pods'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: pod
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# 升级配置
kubectl apply -f prometheus-config.yml
# 获取 prometheus pod
kubectl get pods -l app=prometheus
# 删除 pod
kubectl delete pods <pod name>
# 查看 pod 状态
kubectl get pods
# 重新访问 ui 界面
2
3
4
5
6
7
8
9
10
11
12
13
系统时间同步
# 查看系统时间
date
# 同步网络时间
ntpdate cn.pool.ntp.org
2
3
4
5
监控 k8s 集群
从 kubelet 获取节点容器资源使用情况
# 修改配置文件,增加如下内容,并更新服务
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Exporter 监控资源使用情况
# 创建 node-exporter-daemonset.yml 文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
template:
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9100'
prometheus.io/path: 'metrics'
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: prom/node-exporter
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: scrape
hostNetwork: true
hostPID: true
# 创建 daemonset
kubectl create -f node-exporter-daemonset.yml
# 查看 daemonset 运行状态
kubectl get daemonsets -l app=node-exporter
# 查看 pod 状态
kubectl get pods -l app=node-exporter
# 修改配置文件,增加监控采集任务
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# 通过监控 apiserver 来监控所有对应的入口请求,增加 api-server 监控配置
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- target_label: __address__
replacement: kubernetes.default.svc:443
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
对 Ingress 和 Service 进行网络探测
# 创建 blackbox-exporter.yaml 进行网络探测
apiVersion: v1
kind: Service
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
ports:
- name: blackbox
port: 9115
protocol: TCP
selector:
app: blackbox-exporter
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
containers:
- image: prom/blackbox-exporter
imagePullPolicy: IfNotPresent
name: blackbox-exporter
# 创建资源对象
kubectl -f blackbox-exporter.yaml
# 配置监控采集所有 service/ingress 信息,加入配置到配置文件
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.default.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-ingresses'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.default.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Grafana 可视化
基本概念:
数据源(Data Source)
对于 Grafana 而言,Prometheus 这类为其提供数据的对象均称为数据源(Data Source)。目前,Grafana 官方提供了对:Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch 的支持。对于 Grafana 管理员而言,只需要将这些对象以数据源的形式添加到 Grafana 中,Grafana 便可以轻松的实现对这些数据的可视化工作。
仪表盘(Dashboard)
官方 Dashboard 模板 (opens new window)
通过数据源定义好可视化的数据来源之后,对于用户而言最重要的事情就是实现数据的可视化。在Grafana中,我们通过Dashboard来组织和管理我们的数据可视化图表:
组织和用户
作为一个通用可视化工具,Grafana 除了提供灵活的可视化定制能力以外,还提供了面向企业的组织级管理能力。在 Grafana 中 Dashboard 是属于一个Organization(组织),通过 Organization,可以在更大规模上使用 Grafana,例如对于一个企业而言,我们可以创建多个 Organization,其中 User(用户) 可以属于一个或多个不同的 Organization。 并且在不同的 Organization 下,可以为 User 赋予不同的权限。 从而可以有效的根据企业的组织架构定义整个管理模型。
集成 Grafana
部署 Grafana
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana-core
namespace: kube-system
labels:
app: grafana
component: core
spec:
selector:
matchLabels:
app: grafana
replicas: 1
template:
metadata:
labels:
app: grafana
component: core
spec:
containers:
- image: grafana/grafana:6.5.3
name: grafana-core
imagePullPolicy: IfNotPresent
env:
# The following env variables set up basic auth twith the default admin user and admin password.
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
# - name: GF_AUTH_ANONYMOUS_ORG_ROLE
# value: Admin
# does not really work, because of template variables in exported dashboards:
# - name: GF_DASHBOARDS_JSON_ENABLED
# value: "true"
readinessProbe:
httpGet:
path: /login
port: 3000
# initialDelaySeconds: 30
# timeoutSeconds: 1
volumeMounts:
- name: grafana-persistent-storage
mountPath: /var
volumes:
- name: grafana-persistent-storage
hostPath:
path: /data/devops/grafana
type: Directory
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
服务发现
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: kube-system
labels:
app: grafana
component: core
spec:
type: NodePort
ports:
- port: 3000
nodePort: 30011
selector:
app: grafana
component: core
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
配置 Grafana 面板
添加 Prometheus 数据源。
下载 k8s 面板 (opens new window),导入该面板。
# kube-prometheus
kube-prometheus 是一整套监控解决方案,它使用 Prometheus 采集集群指标,Grafana 做展示,集成了很多模板。
Prometheus Operator 在 Kubernetes 中引入了自定义资源,以声明 Prometheus 和 Alertmanager 集群以及 Prometheus 配置的理想状态。
- Prometheus
- ServiceMonitor
- PodMonitor
Prometheus 资源以声明方式描述 Prometheus 部署的所需状态,而 ServiceMonitor 和 PodMonitor 资源描述 Prometheus 要监视的目标。
下载
在 GitHub 上下载项目代码。
替换国内镜像
部分镜像在国内无法下载,自行下载后修改 yaml 文件。
cd kube-prometheus-0.11.0/manifests/
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheusOperator-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheus-prometheus.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' alertmanager-alertmanager.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' kubeStateMetrics-deployment.yaml
sed -i 's/k8s.gcr.io/lank8s.cn/g' kubeStateMetrics-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' nodeExporter-daemonset.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheusAdapter-deployment.yaml
sed -i 's/k8s.gcr.io/lank8s.cn/g' prometheusAdapter-deployment.yaml
# 查看是否还有国外镜像
grep "image: " * -r
2
3
4
5
6
7
8
9
10
11
12
13
安装
# 先安装必备资源:manifests/setup
kubectl apply --server-side -f manifests/setup
# 等待,可选
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
# 安装其他资源
kubectl apply -f manifests/
2
3
4
5
6
7
8
9
10
11
添加 NodePort
为了可以从外部访问 Promethes,Alertmanager,Grafana,我们这里修改 Promethes,Alertmanager,Grafana 的 service 类型为 NodePort 类型
vim prometheus-service.yaml
vim alertmanager-service.yaml
vim grafana-service.yaml
2
3
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
type: NodePort # 新增
ports:
- name: web
port: 9090
targetPort: web
nodePort: 39090 # 新增
selector:
app: prometheus
prometheus: k8s
sessionAffinity: ClientIP
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
配置 Ingress
这里使用 ingress-nginx,同时 NetworkPolicy 中修改 ingress 方向,增加允许 ingress-nginx 访问权限。
修改 grafana-networkPolicy.yaml
,增加允许来自命名空间 ingress-nginx 的访问
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 9.3.2
name: grafana
namespace: monitoring
spec:
egress:
- {}
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
- podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
ports:
- port: 3000
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
policyTypes:
- Egress
- Ingress
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
同理修改 alertmanager-networkPolicy.yaml
、prometheus-networkPolicy.yaml
,增加允许来自命名空间 ingress-nginx 的访问。
如果本地测试,则可以添加 hosts 模拟域名(C:\Windows\System32\drivers\etc\hosts
):
注意:IP 填写的是 ingress-nginx 安装的 node 节点,不是 master 节点。
# 通过域名访问(没有域名可以在主机配置 hosts)
192.168.199.29 grafana.youngkbt.cn
192.168.199.29 prometheus.youngkbt.cn
192.168.199.29 alertmanager.youngkbt.cn
2
3
4
创建 prometheus-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: monitoring
name: prometheus-ingress
spec:
ingressClassName: nginx
rules:
- host: grafana.youngkbt.cn # 访问 Grafana 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
- host: prometheus.youngkbt.cn # 访问 Prometheus 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-k8s
port:
number: 9090
- host: alertmanager.youngkbt.cn # 访问 alertmanager 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: alertmanager-main
port:
number: 9093
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
创建 ingress
kubectl apply -f prometheus-ingress.yaml
卸载
kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup