1. Overview
What SelfHeal does
SelfHeal connects your monitoring system (for example Prometheus Alertmanager) to your servers via SSH. When an alert fires, SelfHeal:
- Ingests the alert via a webhook or event pipeline.
- Creates or updates an incident in its internal store.
- Decides what to do using:
- deterministic rules you define,
- AI-powered modes (Advisor / Single action / Loop), or
- oeno action if policy, mode or license restricts it.
- Executes or simulates commands/scripts over SSH.
- Records a detailed trail of decisions, commands and AI reasoning, and presents it as a clear incident timeline in the UI.
Key design principles
- Agentless: no agents on your hosts, only SSH from the SelfHeal node.
- Safety first: explicit instance allowlist, command allow/block lists, maintenance windows, dry-run and simulation.
- Audit by default: every decision, action and AI step is recorded and surfaced in the UI for review and compliance.
- Simple to run: a single SelfHeal node with an embedded database, and optionally an event stream (for example Kafka) for larger setups.
What you need
- A Linux host to run SelfHeal.
- SSH access from the SelfHeal node to your servers.
- A monitoring system that can send webhooks or events (for example Alertmanager).
- (Optional) A license issued from the Ganges portal to unlock AI modes and higher node limits.
2. Architecture
High-level components
- Ingest & UI service
- Receives alerts from your monitoring system via a webhook endpoint.
- Normalizes alerts and stores a compact incident record.
- Serves the web UI: Dashboard, Inbox, Actions, Rules, Policy, License, Simulate, Incidents.
- Streams live incident updates to the browser so operators see changes in real time.
- Action service
- Consumes normalized alerts from the ingest pipeline.
- Applies mode logic (rules-only, advisory, single action, or loop).
- Runs SSH commands or scripts with multiple layers of guardrails.
- Captures command outputs and AI reasoning for later review.
- State store
- Backed by a lightweight embedded database on the SelfHeal node.
- Stores incidents, actions and AI audit data.
- Designed so operators do not need to manage schema or migrations directly; upgrades take care of internal structure.
- Optional streaming bus
- In larger deployments you can front SelfHeal with a message bus such as Kafka.
- Alerts flow through your stream and into SelfHeal, giving you buffering, replay and decoupling from your monitoring system.
- Ganges license portal
- Runs as a separate portal (this website).
- Issues cryptographically signed license bundles.
- SelfHeal validates licenses locally to enforce node limits and which AI capabilities are enabled.
- License checks are offline; alert data does not leave your cluster.
Alert → Action data flow
- Alert fires in your monitoring system.
- Delivery: the monitoring system sends a webhook or event to SelfHeal using the URL and token from your configuration.
- Ingest & normalization:
- SelfHeal parses the payload, extracts key labels and metadata, and records an incident.
- For each alert, SelfHeal creates a normalized internal view that the action engine uses consistently across sources.
- Decision (safety-first):
- Is the target instance allowed?
- Is the instance currently in a maintenance window?
- Do we have a matching rule for this alert?
- Is AI available (mode + license) for this incident?
- Execution or simulation:
- If a rule applies, SelfHeal runs the associated command or script when it passes policy checks.
- If an AI mode is active, SelfHeal asks the AI planner to propose a safe command or sequence, then applies the same guardrails.
- If anything fails a guardrail, SelfHeal records a clear oeblocked reason instead of executing.
- Observability & UX:
- Every action and AI step is recorded with timestamps, host, status and (when applicable) stdout/stderr.
- The incident view shows a single, chronological story of the alert, decisions and actions taken.
3. Safety & Guardrails
Instance allowlist
SelfHeal will only run actions on instances you have explicitly allowed. Configure this under Settings → Allowlist.
- Add entries like
10.0.0.12ordemo-node. - Enable or disable instances without deleting them.
- Use maintenance windows to temporarily suppress actions but still see alerts.
Command allowlist & blocklist
Every command passes through a layered command policy:
- A built-in safety layer that blocks obviously catastrophic patterns.
- Your explicit command blocklist rules.
- Your explicit command allowlist rules and conservative built-in patterns.
If a command is not clearly allowed, or matches a blocked pattern, it is treated as unsafe and blocked. The reason is visible in both the UI and logs so operators can see exactly why something did not run.
Dry-run and simulation
- Dry-run mode lets SelfHeal go through the full decision pipeline but only record what it would have executed, without actually running commands.
- Simulation mode (from the Simulate page or API) feeds synthetic alerts into SelfHeal. They are processed like real alerts, but simulated incidents never run real commands, regardless of the cluster 's dry-run setting.
- Simulated incidents are clearly marked in the UI so they cannot be confused with production activity.
Webhook & API protections
- Protect the webhook endpoint with an authentication token.
- Restrict access to simulation and admin APIs using your normal controls: HTTPS, basic auth, VPN, IP allowlists, etc.
- In most deployments, SelfHeal sits behind a reverse proxy such as nginx which terminates TLS and enforces authentication.
Licensing and node caps
SelfHeal reads an offline license file issued from the Ganges portal. This license controls how many nodes you can manage and which AI features are enabled.
- License status and limits are visible under the License page.
- Node caps and AI modes (Advisor, Single action, Loop) are enforced by license.
- Even when a license constraint blocks execution, SelfHeal still records what it would have done so you can see intent and tune settings.
4. Capabilities & AI levels
SelfHeal is structured as a capability ladder. You can adopt it step by step, starting with pure visibility and deterministic rules, then enabling AI when your team is comfortable.
Level 0 Ingest & Observe
- Ingest alerts via webhook or event stream.
- Record alerts as incidents with live updates in the UI.
- Use Dashboard, Inbox and Incident views as a oesingle pane of glass for your alerts.
- No automated remediation observation only.
Level 1 Deterministic rules (no AI)
- Define rules that map specific alerts (and optionally instances) to commands or scripts.
- Run in a oerules-only, observe mode to preview actions without executing.
- Switch to oerules-only, enforce mode to automatically run rule-based fixes when they pass safety checks.
- Ideal for simple, well-understood fixes such as restarting a known service.
Level 2 AI Advisor
- When no rule is found, AI reads the alert and proposes a diagnosis and possible commands.
- No commands are executed automatically in this mode.
- Operators see suggestions in the incident view and can choose to run them manually in their own shell.
Level 3 AI Single action
- AI proposes a single, focused command to run for an incident.
- Command policy, allowlist and maintenance checks are applied before any execution.
- If approved by policy, SelfHeal executes the command once, captures output and stops.
- Ideal for oerestart one service or oerun one diagnostic+fix command scenarios on trusted hosts.
Level 4 AI Loop (Plan → Execute → Observe → Iterate)
- AI can run a short, bounded sequence of commands with feedback:
- Step 1: run diagnostics (for example, check disk or memory usage).
- Step 2: choose and run an appropriate fix.
- Step 3: verify state, and if healthy, stop.
- Loops are limited by a configurable maximum number of steps to avoid unbounded automation.
- Each step records the command, output and AI reasoning and appears in the incident timeline.
- Best suited for advanced users who want a oeself-driving runbook on trusted lab or staging clusters first.
You can keep production clusters at Levels 0 2 indefinitely and only enable Levels 3 4 where your team is comfortable with automated changes.
5. UI tour
Dashboard
- High-level service health and connectivity checks.
- Mini oeCurrent incidents list for quick triage.
- Recent actions and AI activity summaries.
- Real-time updates when new alerts arrive.
Inbox
- Newest alerts and incidents first, with search and filters.
- Live updates as alerts come in, without page reloads.
- Click any row to open the full incident timeline.
- Simulated incidents are clearly labelled for demos and tests.
Incidents
- One page per incident (group of related alerts).
- Timeline merges:
- Alert details and labels.
- Actions taken or attempted.
- AI suggestions, plans and loop steps.
- Shows why something was allowed, blocked, simulated or skipped.
- Ideal for handoffs, RCAs and customer demos.
Actions
- Global list of all actions SelfHeal attempted or executed.
- Filters for success/failure, mode, and dry-run vs real execution.
- Detail view shows command, host, timestamps, stdout and stderr.
- Copy buttons for commands and outputs for quick reuse.
Rules
- UI to create, update and delete rules that map alerts to actions.
- Conventions:
- Leave the instance field empty for a global rule.
- Set instance to an IP/hostname to bind a rule to one server.
- Search, copy and safe deletion with confirmation dialogs.
Policy (Allowlist & Command Policy)
- Allowlist:
- Manage which instances can be touched by SelfHeal.
- View licensed vs. currently used nodes.
- Command Policy:
- Define patterns for allowed and blocked commands.
- Review and refine patterns as you see real actions in the system.
Catalog
- Lists built-in oeverbs such as diagnostic helpers and common remediation patterns.
- Optional preview panel to see what a verb intends to do before it runs.
- Useful for safety reviews and explaining SelfHeal 's building blocks to new team members.
License
- Shows current license status, node caps and expiry.
- Upload and activate license bundles issued from the Ganges portal.
- Provides a detailed view for support and debugging.
Simulate
- Central place to send test alerts into SelfHeal without touching your real monitoring system.
- JSON editor prefilled with a realistic alert payload.
- Mode dropdown to simulate rules-only, advisory, single action or loop behaviour.
- Simulated incidents never run real commands, regardless of cluster mode.
- Perfect for demos, regression tests, and verifying rules or policies.
6. Install & runtime layout
Installation model
SelfHeal is distributed as a single Linux tarball with an installer script. The experience is designed to feel familiar to operators used to CNCF-style tools.
- Download
selfheal-1.0.9.tar.gzfrom the Ganges portal. - Extract the archive and run the install script as root.
- The installer:
- Places SelfHeal code in a dedicated installation directory.
- Creates a configuration directory for configuration and license files.
- Sets up system services for the UI/ingest and the action engine.
- Optionally installs and configures nginx as a reverse proxy with HTTPS and basic auth.
For copy-paste commands and concrete examples, refer to the Quickstart page on this portal.
What you get after install
- Two system services:
- one for the web UI and alert ingest,
- one for the action engine.
- A main configuration file where you set:
- cluster mode (rules-only, advisory, single action, loop),
- authentication token for incoming webhooks,
- any non-default ports or paths,
- optional integration with an event stream (like Kafka).
- A place to drop your license file issued from the Ganges portal.
- Log output integrated with your Linux logging system (for example journald), plus structured action and incident history available in the UI.
- An HTTP endpoint (typically behind nginx) where operators access the UI with HTTPS and your chosen authentication method.
The goal is to keep oeday 1 simple: one node, one installer, and a small number of configuration knobs to get from zero to first incident quickly.
7. Typical workflows
Onboarding a new SelfHeal node
- Install SelfHeal on a small Linux VM or bare-metal host.
- Open the main configuration file and set:
- an authentication token for incoming alerts,
- the initial cluster mode (for example rules-only + Advisor),
- any HTTPS or proxy-related settings recommended by your organisation.
- Add one or two lab servers to the Allowlist page.
- Create a simple rule for a test alert (for example a demo alert that runs a harmless echo command).
- Use the Simulate page to send a demo alert.
- Verify Dashboard, Inbox, Actions and Incidents all show the expected story before wiring in your real monitoring system.
Investigating a real incident
- Open Dashboard and click the incident in oeCurrent incidents .
- Review the Incident timeline: alert, actions and AI suggestions (if any).
- Check whether the action was:
- Executed successfully,
- Skipped because the cluster is in dry-run mode,
- Blocked by allowlist or command policy.
- Use stdout/stderr shown in the action details to verify the impact on the remote host.
Adding a new rule from a real alert
- Wait for the real alert to fire and show up in Inbox.
- Open the incident and review labels such as alert name and instance.
- Open Rules and add a new rule that matches those labels and runs your desired command or script.
- Test in a lab environment or with simulated alerts before relying on the rule in production.
Rolling out AI modes safely
- Keep production clusters in Advisor mode initially so AI only suggests actions.
- Enable Single action only for a small set of low-risk alerts on trusted hosts (for example simple service restarts in a lab).
- Monitor actions and AI explanations to build trust with your team.
- Introduce Loop mode later, starting with staging or non-critical clusters, and with conservative command policies.
8. Operations & troubleshooting
Health checks
- SelfHeal exposes a basic liveness endpoint you can scrape from your monitoring system.
- A separate readiness endpoint reports when the database and core services are ready.
- The exact paths and examples are documented in the Quickstart and operator guides.
Common oenothing happened causes
- The target instance is not on the allowlist.
- The proposed command is blocked by command policy.
- The cluster is in dry-run mode.
- The alert did not match any rule and AI modes are either disabled or restricted by license.
Debugging steps
- Check Inbox for the alert. If it is not there, webhook routing or authentication is likely wrong.
- Open the Incident view for that alert and read the timeline.
- Look at the action rows:
- If an action is marked as Blocked, read the policy reason and adjust allowlist or command policy if appropriate.
- If actions are marked as Dry-run, switch the cluster to full execution once you are confident.
- If there are no action rows: SelfHeal either stayed in observation-only mode or had no rule/AI path available for this incident.
- Check the License page for license validity and node caps.
- Use your system logs (for example
journalctl) for deeper diagnostics if the services themselves are unhealthy.
9. FAQ
Is SelfHeal agentless?
Yes. SelfHeal connects to your servers over SSH and does not require any agent daemons on the target machines.
Which OS and workloads are supported?
SelfHeal v1.0.9 focuses on Linux servers and services that can be
managed via SSH commands and scripts. Containers, Kubernetes and other
platforms can be managed via their CLI tools (for example kubectl)
as long as those tools are available on the SelfHeal node.
Can I use AI modes without letting SelfHeal run commands?
Yes. Advisor mode provides AI suggestions only. You can stay in this mode indefinitely and still get value from analysis and recommendations without any automated execution.
How do I keep my secrets safe?
Use your standard Linux practices for SSH keys and other secrets (file permissions, key management, vaults, etc.). SelfHeal reads configuration and license files from its own configuration directory but does not attempt to bypass or replace normal OS security. Root access on the SelfHeal node remains the trust boundary.
Where do I start?
Start with a small lab cluster, one or two alerts, and Level 1 rules plus Advisor mode. Once you 're comfortable with the guardrails and incident timelines, expand to more alerts and consider enabling AI Single action on low-risk scenarios.