Skip to main content

6. Sound Modeling of Async Error Recovery

1. Introduction: The Partial Failure Problem

In a distributed system like the Internet Computer (ICP), partial failure is a first-class reality. When Canister A calls Canister B:

  1. Canister B might trap (runtime error).
  2. The network might time out.
  3. The subnet might be under heavy load, delaying the response indefinitely.
  4. Canister B might be upgraded or deleted while the message is in flight.

For a DeFi protocol, a failed async call is often as dangerous as a successful one. If a "Withdraw" call fails after the user's balance was already decremented in the caller's state, the funds are lost. If the balance wasn't decremented, but the external call actually succeeded despite a network timeout, the funds are double-spent.

Sector9 provides Sound Modeling of Async Error Recovery so protocols can specify and verify their failure paths. It enforces a strict "Try-Catch-Everything" policy and models rollback/commit boundaries conservatively. This article explains the implementation through src/viper/lower_stmt_async.ml and src/viper/try_ctx.ml.

2. Mandatory try/catch Enforcement

Sector9's most visible safety rule is that every await, and every await* that may actually suspend, must appear inside a catch-all try/catch block.

In the Sector9 frontend, a bare await some_call() is rejected by the verifier. The developer must write:

let result = try {
await some_call()
} catch (_) {
fallback_value
};

An await* over an async* computation is treated like a local call when the callee cannot reach a real await; in that non-suspending case it does not introduce an interleaving point.

Why Enforcement?

By forcing the developer to write a catch block, Sector9 forces an explicit failure path. In the Viper lane, the verifier proves that the Actor Invariant holds at the end of the catch block. This means the local proof cannot ignore failed remote calls or assume success-path postconditions on the fallback path.

3. Commit Points and Rollback Domains

To reason about what happens when an error occurs, Sector9 uses the concept of Commit Points.

Static Commit Points

When an await occurs, the current execution path is suspended. The state that has been committed before the suspension becomes the baseline for later success or failure reasoning. In the TLA+ lane, the emitted model records checkpoint snapshots for async correlations and rollback domains. In the Viper lane, the same boundary is modeled through await havoc plus permission release/restore (Subject 1).

Rollback Logic

If an exception is thrown (either locally or as a result of a remote trap), the control flow jumps to the nearest handler in the try_ctx.ml.

The verifier must then model the state of the actor's heap at the start of the catch block. Sector9 implements Rollback Domains:

  1. Local Variables: Generally remain at their last assigned value before the exception.
  2. Actor State: The verifier keeps only the facts justified by the catch path, actor invariants, and the relevant rollback/checkpoint model.
  3. Conservative Havoc: To remain sound, Viper invalidates state facts that may have been affected by the try body or by an interleaving await boundary. This forces the developer to explicitly restore or check these fields in the catch block if they want to reason about them.

4. Async Channel Bookkeeping

A unique feature of Sector9's async modeling is the use of Pending Counters on "Sites."

In lower_stmt_async.ml, every remote call is associated with a "Site Field":

let incs, decs =
match pending with
| Some info ->
let site_fld = (self_e, info.Ir_types.site_field) in
let chan_fld = (self_e, info.Ir_types.chan.Ir_types.field_id) in
( [inc_field_stmt site_fld; inc_field_stmt chan_fld],
[dec_field_stmt site_fld; dec_field_stmt chan_fld] )

Protocol-Level Accounting

This enables the verifier to track "In-Flight Value." For example, a ledger canister can have an invariant: invariant balance_sum + pending_transfers_sum == total_supply

  • When a transfer starts, the pending_transfers counter is incremented.
  • If the await succeeds, the counter is decremented, and the balance is permanently updated.
  • Crucially: If the await fails (enters the catch block), the counter is still decremented (in the generated lowering), and the developer must explicitly "refund" the balance or record the failure.

Because the verifier sees these counter updates, it can prove scoped accounting properties such as conservation of represented value across success and fallback paths.

5. Remote Stubs and Success-Witnesses

When a remote call returns, it provides a Success Witness (RespOk) or an Error Witness (RespErr).

In the try/catch model, Sector9 uses Remote Stubs to represent the external call. The stub defines two potential outcomes:

  1. Success Path: The ensures clauses of the remote method are added to the local context.
  2. Error Path: The ensures clauses are not added, and the verifier only knows that the call failed.

This forces the caller to be "Defensive." You cannot prove a property that depends on the remote call succeeding while you are inside the catch block. This eliminates a major class of "Optimistic Accounting" bugs where developers assume a call succeeded even when it failed.

6. Comparison & Evaluation

vs. Solidity's revert

In Solidity, if an external call fails (and isn't handled via address.call), the entire transaction reverts. This provides "All-or-Nothing" atomicity but limits composability and is impossible in the multi-canister async world of ICP.

  • Solidity: Automatic rollback of all state.
  • Sector9: Manual rollback in the catch block. Sector9 provides the Formal Tools to prove the manual rollback is correct.

vs. Move Prover's aborts_if

Move functions often have aborts_if specs that define exactly when a function will fail.

  • Move: The prover ensures the function only aborts when the spec says so.
  • Sector9: Acknowledges that on ICP, an await can fail for reasons outside the code's control (e.g., network timeout). Thus, Sector9 focuses on Recovery Soundness rather than Abort Prevention.

vs. Erlang's "Let it Crash"

Erlang encourages processes to crash and be restarted by supervisors.

  • Erlang: Focuses on availability through restarts.
  • Sector9: Focuses on Consistency. Crashing is acceptable, but the state left behind must always satisfy the actor's safety invariants.

7. Conclusion

Sector9’s model of async error recovery is "Pessimistic but Provable." By enforcing mandatory try/catch and providing precise channel-based accounting, it turns network failure into an explicit state transition that can be verified against written invariants. For high-stakes DeFi, this is the difference between hoping failure paths work and proving the modeled accounting properties on both success and fallback paths.

Strengths:

  1. Safety First: Mandatory catch makes async error paths explicit in verified code.
  2. Accounting Integrity: Pending counters allow for proving "Conservation of Value" even across failures.
  3. Soundness: Conservative havoc in rollback domains prevents unsound assumptions about state after a crash.

Weaknesses:

  1. Verbosity: Forcing try/catch around every await significantly increases code size.
  2. Manual Rollback: Developers must manually write the rollback logic. (Future versions might automate some "Undo" patterns).