Server Health Monitoring
If Moltbot runs on your VPS, you already have an AI agent sitting on the server. Put it to work as a lightweight monitoring system that checks disk usage, memory, and error logs — and only notifies you when something is wrong.
This is not a replacement for Prometheus or Datadog. It is a simple, zero-infrastructure alerting layer that requires nothing more than the Moltbot you already have running.
Prerequisites
- Moltbot running on a VPS or server — This recipe requires Moltbot to have local access to the server it is monitoring
- Shell or command execution MCP tool — Moltbot needs the ability to run system commands like
df,free, andjournalctl - Scheduled tasks enabled — See Scheduled Tasks
How It Works
A cron job triggers Moltbot at regular intervals. The prompt instructs it to run system commands, interpret the output, and decide whether the results warrant a notification. The critical design principle is: alert on anomalies only, no noise.
If disk usage is at 45% and there are no errors in the logs, you hear nothing. If disk usage spikes to 85% or a critical error appears, you get a Telegram message immediately.
Setup
Step 1: Basic Health Check
Start with a straightforward configuration:
cron:
- name: server-health
schedule: "0 */6 * * *"
channel: telegram
prompt: |
Run these checks:
1. Disk usage (df -h)
2. Memory usage (free -m)
3. Recent error logs (journalctl -p err --since "6 hours ago")
If disk usage exceeds 80% or there are critical errors, notify me.
Otherwise, send nothing.This runs every 6 hours. The "send nothing" instruction is essential — without it, you would receive four "everything is fine" messages per day.
Step 2: Add More Checks
Expand the health check to cover additional concerns:
cron:
- name: server-health-extended
schedule: "0 */6 * * *"
channel: telegram
prompt: |
Run the following server health checks:
1. Disk usage (df -h) — alert if any partition exceeds 80%
2. Memory usage (free -m) — alert if available memory is below 500MB
3. CPU load average (uptime) — alert if 15-min average exceeds 4.0
4. Docker containers (docker ps -a) — alert if any container has exited or is restarting
5. Recent errors (journalctl -p err --since "6 hours ago") — summarize if any found
6. SSL certificate expiry (echo | openssl s_client -connect mydomain.com:443 2>/dev/null | openssl x509 -noout -dates) — alert if expiring within 14 days
For each issue found, include:
- What the problem is
- The actual value vs. the threshold
- A suggested action
If everything is healthy, send nothing.Step 3: Process-Specific Monitoring
Monitor critical services that must stay running:
cron:
- name: service-watchdog
schedule: "*/15 * * * *"
channel: telegram
prompt: |
Check if these services are running:
- nginx (systemctl is-active nginx)
- postgresql (systemctl is-active postgresql)
- moltbot itself (docker ps | grep moltbot)
- redis (systemctl is-active redis)
If any service is down, notify me immediately with the service name
and the output of its status command.
If all services are running, send nothing.Step 4: Log Analysis
Go beyond simple error detection with AI-powered log analysis:
cron:
- name: log-analysis
schedule: "0 8 * * *"
channel: telegram
prompt: |
Analyze the last 24 hours of logs:
1. Run: journalctl --since "24 hours ago" -p warning
2. Run: tail -100 /var/log/nginx/error.log
3. Run: docker logs moltbot --since 24h 2>&1 | tail -50
Look for:
- Repeated errors (same error appearing many times)
- New errors not seen before
- Patterns that suggest an emerging problem
Provide a brief analysis. If nothing notable, send nothing.Step 5: Resource Trend Tracking (Optional)
Use memory to track trends over time:
cron:
- name: resource-snapshot
schedule: "0 */6 * * *"
channel: telegram
prompt: |
Record current resource usage to memory:
- Disk usage percentage for /
- Memory usage percentage
- Number of running Docker containers
Compare with the previous snapshot in memory.
If disk usage grew by more than 5% since last check, alert me.
If memory usage has been above 80% for 3 consecutive checks, alert me.
Otherwise, just save the snapshot silently (do NOT send a message).This gives you trend-based alerting, not just point-in-time checks.
Edge Cases and Troubleshooting
- Permission issues: Some commands (e.g.,
journalctl,docker ps) require specific permissions. Make sure the user or container running Moltbot has the necessary access. Running Moltbot as root is not recommended; instead, add the user to thedockerandsystemd-journalgroups. - Command availability: Not all servers have the same tools installed.
journalctlis systemd-specific; Alpine-based containers usesyslog. Adjust commands for your environment. - False positives: A brief CPU spike during a backup might trigger an alert. Tune thresholds to avoid noise: use 15-minute load averages instead of 1-minute, and set disk thresholds appropriate for your server's capacity.
- Alert fatigue: If a known issue causes repeated alerts (e.g., a disk that is always at 82%), either fix the underlying issue or adjust the threshold temporarily: "Ignore disk usage on /data until I expand the volume next week."
- Time-based noise: Log files often have scheduled spikes (e.g., cron jobs that produce errors at specific times). If you notice patterns, add exceptions to the prompt: "Ignore the 'backup rotation' warnings from logrotate."
Pro Tips
- Use memory for incident history. When an alert fires, Moltbot can save it to memory. Later, you can ask: "Show me all server alerts from the past month. Is there a pattern?" This is lightweight incident tracking without any additional tools.
- Combine checks intelligently. Instead of separate cron jobs for each check, consolidate into one comprehensive health check. This reduces cron entries and ensures a single, coherent alert message when multiple things go wrong simultaneously.
- Pair with a real monitoring stack. If you use Prometheus/Grafana for dashboards, Moltbot can complement them as a notification layer. It adds AI interpretation — instead of "disk at 83%," you get "disk at 83%, growing 2% per day, you have approximately 8 days before it fills up."
- Monitor external services too. Add URL health checks: "Curl https://myapp.com/health and alert if the response is not 200 or if response time exceeds 5 seconds."
- Set up an escalation chain. "If the same alert fires twice in a row (2 consecutive checks), escalate by including 'URGENT' in the message title."
Related Pages
- Scheduled Tasks — Cron configuration for health checks
- MCP Tools — Shell execution tools
- Memory System — Track resource trends and incident history
- Personalized Daily Briefing — Include server status in your morning briefing
- Creative Use Cases — More automation ideas