Redundancy groups¶

A redundancy group protects one set of resources with two or more UPS sources. Eneru shuts the group down only when the configured quorum is lost.

Use this for dual-PSU servers, A+B rack feeds, and other setups where losing one UPS does not mean the protected system has lost power.

When to use one¶

Situation	Use a redundancy group?
One UPS feeds one server	No. Use a normal UPS group
Two UPSes feed two independent racks	No. Use multi-UPS groups
Two UPSes feed both PSUs on the same server	Yes
Two UPSes feed a shared A+B rack	Yes

The practical test is simple: if a resource can survive one UPS failure, put it in a redundancy group instead of assigning it directly to a single UPS group.

Example¶

ups:
  - name: "UPS-A@10.0.0.10"
  - name: "UPS-B@10.0.0.11"

redundancy_groups:
  - name: "rack-1-dual-psu"
    ups_sources:
      - "UPS-A@10.0.0.10"
      - "UPS-B@10.0.0.11"
    min_healthy: 1
    degraded_counts_as: healthy
    unknown_counts_as: critical
    remote_servers:
      - name: "Compute Node 1"
        enabled: true
        host: "10.0.0.20"
        user: "root"

With two UPSes and min_healthy: 1, the group tolerates one failed UPS. It shuts down only when the healthy count drops below 1.

Fields¶

Key	Default	Description
`name`	required	Unique group label
`ups_sources`	required	Two or more UPS names from the top-level `ups:` list
`min_healthy`	`1`	Shutdown fires when healthy member count is below this number
`degraded_counts_as`	`healthy`	Count DEGRADED members as `healthy` or `critical`
`unknown_counts_as`	`critical`	Count UNKNOWN members as `critical`, `degraded`, or `healthy`
`is_local`	`false`	This group powers the Eneru host and may own local resources
`triggers`	inherits	Trigger overrides for this group
`remote_servers`	`[]`	Remote resources owned by the group
`virtual_machines`, `containers`, `filesystems`	disabled	Local resources. Valid only when `is_local: true`

Quorum¶

The group fires when:

healthy_count < min_healthy

For a two-UPS dual-PSU server:

`min_healthy`	Behavior
`1`	Shut down when both UPSes fail. This is the usual choice
`2`	Shut down when either UPS fails. This removes practical redundancy

For a three-UPS group:

`min_healthy`	Behavior
`1`	Tolerate two failed members
`2`	Tolerate one failed member
`3`	Shut down when any member fails

min_healthy: 0 is invalid because the group would never shut down.

Member states¶

Each UPS member is classified on every evaluator tick.

State	Meaning
`HEALTHY`	UPS reports usable data and no active problem
`DEGRADED`	UPS is visible but in a warning state, such as on battery or voltage warning
`CRITICAL`	UPS hit a shutdown trigger, FSD, overload-critical path, or explicit advisory condition
`UNKNOWN`	Snapshot is stale, NUT connection is lost, or the monitor cannot provide current data

degraded_counts_as controls whether warning states still contribute to quorum. unknown_counts_as controls how missing data is counted. The default is tolerant of degraded power but fail-safe on missing visibility.

Advisory triggers¶

Member UPS triggers still run. In a redundancy group they do not directly run the shutdown sequence. They mark the member as advisory-critical, then the group evaluator decides whether quorum is gone.

Group-level triggers: are also evaluated by the redundancy evaluator itself. Use them when a shared resource has a different risk budget than the member UPS defaults; the group trigger can mark that member critical for quorum without changing the UPS monitor's own config.

depletion.critical_rate and depletion.grace_period can be group-local. depletion.window cannot: the rolling depletion rate is calculated by each UPS monitor before the redundancy evaluator sees the snapshot. Configure depletion.window globally or on ups[*].triggers instead.

You will see log lines like:

Trigger condition met (advisory, redundancy group): battery below threshold

For min_healthy: 1, that advisory condition only shuts down the protected resource if every other member has also stopped counting as healthy.

Member UPS authority¶

Once a UPS is named in a redundancy group's ups_sources, it becomes fully advisory. Any remote_servers, virtual_machines, containers, or filesystems still configured under that UPS group are no longer shut down by that UPS's own triggers — only group quorum loss (or drain_on_local_shutdown on the local group) drains them. This is easy to miss: an operator reasonably expects per-UPS protection to keep working. Eneru emits a validation WARNING when a redundancy member still carries its own shutdown resources so the loss of per-UPS authority is explicit. Move those resources to the redundancy group's own remote_servers (or to the group's ownership) if you want them protected by quorum.

Local ownership¶

At most one group across the whole config can be is_local: true. That group may own local VMs, containers, filesystems, and local shutdown behavior.

This is valid:

redundancy_groups:
  - name: "local-dual-feed"
    is_local: true
    virtual_machines:
      enabled: true

This is not valid if another UPS group already has is_local: true.

Remote-server ownership¶

A remote server, identified by host and user, can belong to only one place:

One UPS group's remote_servers list.
One redundancy group's remote_servers list.

Validation rejects duplicate ownership so Eneru cannot shut down the same server through two paths.

Validate¶

sudo eneru validate --config /etc/ups-monitor/config.yaml

Validation prints configured redundancy groups:

Redundancy groups (1):
  1. rack-1-dual-psu
     Sources (2): UPS-A@10.0.0.10, UPS-B@10.0.0.11
     Quorum: min_healthy=1 (degraded->healthy, unknown->critical)
     Remote servers (1): Compute Node 1

Common validation failures:

Error class	Cause
Unknown UPS source	`ups_sources` does not exactly match a top-level `ups[].name`
Duplicate UPS source	Same member listed twice
Duplicate group name	Two redundancy groups share a name
Multiple local groups	More than one UPS or redundancy group has `is_local: true`
Duplicate remote ownership	Same `host` and `user` assigned to more than one group

Failure timeline¶

For a dual-UPS group with min_healthy: 1 and default counting:

Time	Event	Group result
0s	UPS-A loses input power	UPS-A is `DEGRADED`, UPS-B is `HEALTHY`. Quorum holds
60s	UPS-A hits low battery	UPS-A is `CRITICAL`, UPS-B is `HEALTHY`. Quorum still holds
90s	UPS-B also loses input power	UPS-B is `DEGRADED` and counts as healthy by default. Quorum holds
120s	UPS-B hits low battery	Both members are `CRITICAL`. Quorum is lost and shutdown starts

If you want the group to shut down as soon as a member is merely degraded, set degraded_counts_as: critical.

Sizing warning¶

Redundancy only works if the remaining feed can carry the load. For A+B power, verify that each UPS can carry the full protected load during single-feed operation.

Good targets:

Normal operation: each UPS at or below about 50% load.
Single-feed degraded operation: surviving UPS below about 80% load.
Extra headroom for inrush, battery age, and generator transitions.

If the surviving UPS overloads, software cannot save the rack. Fix the load or UPS sizing.

Operational notes¶

Re-entry guard / flag-file lifecycle¶

Each redundancy group has a flag at /var/run/ups-shutdown-redundancy-{group} that prevents a single quorum-loss event from firing the shutdown sequence twice. From 5.3.0 onward the daemon owns this flag's lifecycle:

Cleared at coordinator startup so a stale flag from a prior daemon instance can't silently block the next quorum loss. This is the load-bearing guarantee; the next two clears are optimizations. From 5.3.0-rc4 onward the flag records the owning PID plus Linux process start identity when available, and startup refuses to clear it if that same process is still running. That prevents two overlapping daemon instances from erasing each other's in-flight guard without confusing a stale flag with a reused PID.
Cleared on quorum recovery so the next quorum loss can fire its own shutdown without waiting for a daemon restart. Look for quorum restored -- re-armed for next event in the log.
Cleared on graceful exit via SIGINT / SIGTERM so the next start is clean. A non-graceful exit (crash, SIGKILL, OOM) leaves the flag on disk, but the next coordinator startup re-clears it.

The flag's only role is in-flight re-entry protection within a single quorum-loss event. It never persists across runs or across events. If the flag cannot be inspected or removed at startup, Eneru treats that as fatal and exits rather than starting with an unreliable shutdown guard.

If the daemon ever sees the flag at first call (someone touched it manually, /var/run is read-only, the startup-cleanup hook was bypassed) it logs the exact line below — operators can grep their journal for it verbatim:

⚠️  Redundancy shutdown for '{group}' suppressed: flag /var/run/ups-shutdown-redundancy-{group} already present at first call (startup cleanup bypassed). Will re-arm when quorum recovers.

Pre-5.3.0 the suppression was silent and cost some operators (issue #4) hours of debugging. If you see the warning, check for stuck SSH sessions, stale lock files, or a previous instance that didn't exit cleanly.

Troubleshooting¶

Symptom	Likely cause
One UPS failed but nothing shut down	Quorum still holds. Check `min_healthy` and member states
On-battery member still counts healthy	`degraded_counts_as: healthy` is the default
Advisory log appears but no shutdown	Another member still satisfies quorum
Group never starts	`ups_sources` names do not match exactly
Repeated tests do not fire	Pre-5.3.0 only: stale `/var/run/ups-shutdown-redundancy-*` from a prior run. 5.3.0+ clears these automatically at startup
`⚠️ Redundancy shutdown … suppressed` log line	The startup-cleanup contract was bypassed — see Operational notes above