Skip to content

Server Health Monitoring

If Moltbot runs on your VPS, you already have an AI agent sitting on the server. Put it to work as a lightweight monitoring system that checks disk usage, memory, and error logs — and only notifies you when something is wrong.

This is not a replacement for Prometheus or Datadog. It is a simple, zero-infrastructure alerting layer that requires nothing more than the Moltbot you already have running.

Prerequisites

  • Moltbot running on a VPS or server — This recipe requires Moltbot to have local access to the server it is monitoring
  • Shell or command execution MCP tool — Moltbot needs the ability to run system commands like df, free, and journalctl
  • Scheduled tasks enabled — See Scheduled Tasks

How It Works

A cron job triggers Moltbot at regular intervals. The prompt instructs it to run system commands, interpret the output, and decide whether the results warrant a notification. The critical design principle is: alert on anomalies only, no noise.

If disk usage is at 45% and there are no errors in the logs, you hear nothing. If disk usage spikes to 85% or a critical error appears, you get a Telegram message immediately.

Setup

Step 1: Basic Health Check

Start with a straightforward configuration:

yaml
cron:
  - name: server-health
    schedule: "0 */6 * * *"
    channel: telegram
    prompt: |
      Run these checks:
      1. Disk usage (df -h)
      2. Memory usage (free -m)
      3. Recent error logs (journalctl -p err --since "6 hours ago")
      If disk usage exceeds 80% or there are critical errors, notify me.
      Otherwise, send nothing.

This runs every 6 hours. The "send nothing" instruction is essential — without it, you would receive four "everything is fine" messages per day.

Step 2: Add More Checks

Expand the health check to cover additional concerns:

yaml
cron:
  - name: server-health-extended
    schedule: "0 */6 * * *"
    channel: telegram
    prompt: |
      Run the following server health checks:

      1. Disk usage (df -h) — alert if any partition exceeds 80%
      2. Memory usage (free -m) — alert if available memory is below 500MB
      3. CPU load average (uptime) — alert if 15-min average exceeds 4.0
      4. Docker containers (docker ps -a) — alert if any container has exited or is restarting
      5. Recent errors (journalctl -p err --since "6 hours ago") — summarize if any found
      6. SSL certificate expiry (echo | openssl s_client -connect mydomain.com:443 2>/dev/null | openssl x509 -noout -dates) — alert if expiring within 14 days

      For each issue found, include:
      - What the problem is
      - The actual value vs. the threshold
      - A suggested action

      If everything is healthy, send nothing.

Step 3: Process-Specific Monitoring

Monitor critical services that must stay running:

yaml
cron:
  - name: service-watchdog
    schedule: "*/15 * * * *"
    channel: telegram
    prompt: |
      Check if these services are running:
      - nginx (systemctl is-active nginx)
      - postgresql (systemctl is-active postgresql)
      - moltbot itself (docker ps | grep moltbot)
      - redis (systemctl is-active redis)

      If any service is down, notify me immediately with the service name
      and the output of its status command.
      If all services are running, send nothing.

Step 4: Log Analysis

Go beyond simple error detection with AI-powered log analysis:

yaml
cron:
  - name: log-analysis
    schedule: "0 8 * * *"
    channel: telegram
    prompt: |
      Analyze the last 24 hours of logs:
      1. Run: journalctl --since "24 hours ago" -p warning
      2. Run: tail -100 /var/log/nginx/error.log
      3. Run: docker logs moltbot --since 24h 2>&1 | tail -50

      Look for:
      - Repeated errors (same error appearing many times)
      - New errors not seen before
      - Patterns that suggest an emerging problem

      Provide a brief analysis. If nothing notable, send nothing.

Step 5: Resource Trend Tracking (Optional)

Use memory to track trends over time:

yaml
cron:
  - name: resource-snapshot
    schedule: "0 */6 * * *"
    channel: telegram
    prompt: |
      Record current resource usage to memory:
      - Disk usage percentage for /
      - Memory usage percentage
      - Number of running Docker containers

      Compare with the previous snapshot in memory.
      If disk usage grew by more than 5% since last check, alert me.
      If memory usage has been above 80% for 3 consecutive checks, alert me.
      Otherwise, just save the snapshot silently (do NOT send a message).

This gives you trend-based alerting, not just point-in-time checks.

Edge Cases and Troubleshooting

  • Permission issues: Some commands (e.g., journalctl, docker ps) require specific permissions. Make sure the user or container running Moltbot has the necessary access. Running Moltbot as root is not recommended; instead, add the user to the docker and systemd-journal groups.
  • Command availability: Not all servers have the same tools installed. journalctl is systemd-specific; Alpine-based containers use syslog. Adjust commands for your environment.
  • False positives: A brief CPU spike during a backup might trigger an alert. Tune thresholds to avoid noise: use 15-minute load averages instead of 1-minute, and set disk thresholds appropriate for your server's capacity.
  • Alert fatigue: If a known issue causes repeated alerts (e.g., a disk that is always at 82%), either fix the underlying issue or adjust the threshold temporarily: "Ignore disk usage on /data until I expand the volume next week."
  • Time-based noise: Log files often have scheduled spikes (e.g., cron jobs that produce errors at specific times). If you notice patterns, add exceptions to the prompt: "Ignore the 'backup rotation' warnings from logrotate."

Pro Tips

  • Use memory for incident history. When an alert fires, Moltbot can save it to memory. Later, you can ask: "Show me all server alerts from the past month. Is there a pattern?" This is lightweight incident tracking without any additional tools.
  • Combine checks intelligently. Instead of separate cron jobs for each check, consolidate into one comprehensive health check. This reduces cron entries and ensures a single, coherent alert message when multiple things go wrong simultaneously.
  • Pair with a real monitoring stack. If you use Prometheus/Grafana for dashboards, Moltbot can complement them as a notification layer. It adds AI interpretation — instead of "disk at 83%," you get "disk at 83%, growing 2% per day, you have approximately 8 days before it fills up."
  • Monitor external services too. Add URL health checks: "Curl https://myapp.com/health and alert if the response is not 200 or if response time exceeds 5 seconds."
  • Set up an escalation chain. "If the same alert fires twice in a row (2 consecutive checks), escalate by including 'URGENT' in the message title."

Community tutorial site — not affiliated with official Moltbot