共计 1837 个字符,预计需要花费 5 分钟才能阅读完成。
阿里云 oss nas 需要时刻巡检注意监控流量以及存储大小,需要监控是担心 oss 滥用,nas 的瓶颈问题。nas 是我选择了用来做 k8s 的应用的存储日志。由于最近需要把所有的应用部署到 k8s 上,所以必须做好监控。
首先需要熟悉阿里云的 sdk 的相关接口。来获取到实时数据,通过 prometheus 来自动采集,只需要写一个 oss nas 的 exporter 即可, prometheus 通过 metrics 直接获取到监控数据。
我选择把代码部署到 k8s 上打包镜像
# Use Python 3.10 to match the project runtime
ARG TARGETPLATFORM=linux/amd64
FROM --platform=$TARGETPLATFORM python:3.10-slim
# Ensure reliable Python runtime in containers
ARG http_proxy
ARG https_proxy
ARG ALL_PROXY
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
SERVER_HOST=0.0.0.0 \
SERVER_PORT=8000 \
METRICS_PATH=/metrics \
http_proxy=${http_proxy} \
https_proxy=${https_proxy} \
ALL_PROXY=${ALL_PROXY} \
HTTP_PROXY=${http_proxy} \
HTTPS_PROXY=${https_proxy}
# App workdir
WORKDIR /app
# Install Python dependencies first (better layer caching)
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
# Copy only runtime source code (exclude docs, tests, configs)
COPY main.py /app/main.py
COPY aliyun_client.py /app/aliyun_client.py
COPY prometheus_metrics.py /app/prometheus_metrics.py
COPY config.py /app/config.py
# Prometheus will scrape this port
EXPOSE 8000
# Default command: start exporter
CMD ["python", "main.py"]
部署 k8s 的 yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nas-exporter
namespace: monitoring
labels:
app: nas-exporter
spec:
replicas: 1
selector:
matchLabels:
app: nas-exporter
template:
metadata:
labels:
app: nas-exporter
spec:
containers:
- name: nas-exporter
image: aliyun-nas-exporter:x86
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8000
envFrom:
- configMapRef:
name: nas-exporter-config
- secretRef:
name: nas-exporter-secrets
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 3
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 1
memory: 512Mi
然后配置 prmetheus 进行采集实现效果如下


webhook 报警

这样就可以通过巡检加报警的方式来发现问题,然后把问题在萌芽之中解决。也是变相的降本增效!
微信扫描下方的二维码阅读本文

正文完
发表至: 监控
近两天内