Frosty: Bringing Strong Liveness Guarantees to the Snow Family of Consensus Protocols
Summary. Snowman is the consensus protocol run by blockchains on Avalanche and is part of the Snow family of protocols. A major advantage of Snowman is the low communication overhead: each consensus decision only requires an expected constant communication overhead per validator in the “common” case that the protocol is not under substantial attack. This low overhead is the key property that would enable a consensus protocol to scale to n = 10000 or more independent validators. On the other hand, the two following concerns have remained:
- Providing formal proofs of consistency for Snowman has presented a serious challenge.
- Liveness attacks exist in the case that a Byzantine adversary controls a sufficiently large proportion (more than O(sqrt(n)) of the validators, slowing termination to more than a logarithmic number of steps.
In this post, we briefly discuss a recent paper that addresses the two issues above by providing a formal consistency proof for Snowman (showing that finalized transactions will not be reverted) and then describing how to augment Snowman with a “liveness module” that can be triggered in the case that a substantial adversary launches a liveness attack.
Background. Since Bitcoin was announced to the world in 2008, there has been a race to develop consensus protocols that work efficiently ‘at scale’. In concrete terms, this means looking to minimize the latency and communication complexity per consensus decision as a function of the number of validators (sometimes called ‘miners’ in the context of proof-of-work protocols). Unfortunately, a classic result of Dolev and Reischuk from 1985 presents a fundamental barrier in this regard. The Dolev-Reischuk bound asserts that deterministic protocols that can tolerate a Byzantine (i.e. arbitrary) adversary of size O(n) must necessarily suffer a quadratic blow-up in communication cost as the size of the network grows. It is precisely this relationship that makes these protocols susceptible to considerable slowdown when the number of validators is large.
Probabilistic sortition. One approach to dealing with this quadratic blow-up in communication cost, as employed by protocols such as Algorand, is to utilize probabilistic sortition. Rather than have all validators participate in every consensus decision, the basic idea is to sample a committee of sufficient size that the proportion of adversarial committee members is almost certainly close to the proportion of all validators that are adversarial. Sampled committees of constant bounded size can then be used to implement consensus, thereby limiting the communication cost. In practical terms, however, avoiding adversarial control of committees requires each committee to have a number of members sufficient that the quadratic communication cost for the committee is already substantial, e.g. Algorand requires committees with k members, where k is of the order of one thousand, meaning that k^2 is already large.
The Snow family of consensus protocols. The original Avalanche whitepaper described a family of consensus protocols that provide an alternative approach to limiting communication costs. These protocols are all based on a common approach that is best described by considering a binary decision game. For the sake of simplicity, consider the Snowflake protocol which uses three parameters: k, alpha > k/2, and beta (in the whitepaper, other variants such as the Slush and Snowball protocols are also described). For the sake of concreteness, think of k as being a small number, less than 100. Suppose that each validator begins with an initial color, either red or blue. Each validator, p say, then proceeds in rounds. In each round, p randomly samples p validators from the total population and asks those validators to report their present color. If at least alpha of the reported values are the opposite of p’s present color, then p adopts that opposite color. If p sees beta consecutive rounds in which at least alpha of the reported values are red, then p decides red (and similarly for blue).
The outcome of this dynamic sampling process can be informally described as follows when the adversary is sufficiently bounded. Once the proportion of the population who are red, say, passes a certain tipping point, it holds with high probability that the remainder of the (non-adversarial) population will quickly become red (and symmetrically so for blue). If beta is set appropriately, then the chance that any honest (non-adversarial) validator decides on red before this tipping point is reached can be made negligible, meaning that once any honest validator decides on red (or blue), they can be sure that all other honest validators will quickly decide the same way. The chance that honest validators decide differently can thus be made negligible through an appropriate choice of parameter values. If honest validators begin heavily weighted in favor of one color, then convergence on a decision value will happen very quickly, while variance in random sampling is required to tip the population in one direction in the case that initial inputs are evenly distributed.
While the discussion above considers a single binary decision game, the “Snowman” protocol shows that similar techniques can be used to efficiently solve State Machine Replication (SMR) (i.e. to build a full blockchain that continuously finalizes transactions). The transition from simple consensus (Byzantine Agreement) to an efficient SMR protocol is non-trivial, and is described in detail in the paper. A major benefit of the approach is that it avoids the need for all-to-all communication. In an analysis establishing that there is only a small chance of consistency failure (i.e. that honest processors will finalize the same transactions except with small probability), the value of k can be specified independent of the number of validators n, and each round requires each validator to collect reported values from only k others.
The new results in the paper. The Snowman protocol is presently used by the Avalanche blockchain to implement SMR. However, the two following concerns have remained:
- Providing formal proofs of consistency for Snowman has presented a serious challenge.
- Liveness attacks exist in the case that a Byzantine adversary controls a sufficiently large proportion (more than O(sqrt(n)) of the validators, slowing termination to more than a logarithmic number of steps.
The new paper addresses the two issues above. With respect to issue (1), the paper:
- Describes a variant of Snowflake, called Snowflake+.
- Gives a complete specification of a version of Snowman that builds on Snowflake+.
- For appropriate parameter values, gives a simple proof of consistency for the resulting Snowman protocol.
- Describes a variant of Snowflake+ called Error-driven Snowflake+, that can be used to give very low latency in the “common case” that there is no substantial attack.
With regard to issue (2), it is worth noting that malicious liveness attacks on Avalanche have not been observed to date. It is certainly desirable, however, to have strong guarantees in the case that a large adversary launches an attack on liveness. The approach taken in the paper is therefore to strike a practical balance. More specifically, the aim is to specify a protocol that is optimized to work efficiently in the “common case” that there is no substantial Byzantine attacker, but which also provides a “fallback” mechanism in the worst case of a substantial attack on liveness.
With that aim in mind, the paper describes how to supplement Snowman with a “liveness module”. The basic idea is that one can use Snowman to reach fast consensus under normal operation, and can then trigger an “epoch change” that temporarily implements some standard quorum-based protocol to achieve liveness in the case that a substantial adversary attacks liveness. In the (presumably rare) event that a substantial adversary attacks liveness, liveness is thus ensured by temporarily forgoing the communication complexity advantages of Snowman during normal operation. The difficulty in implementing such a module is to ensure that interactions between the different modes of operation do not impact consistency. The paper gives a formal proof that the resulting protocol, called Frosty, is consistent and live, except with small error probability.
The full paper can be found at https://arxiv.org/abs/2404.14250 and is a collaboration between Andrew Lewis-Pye (London School of Economics) and Aaron Buchwald, Stephen Buttolph, Patrick O’Grady and Kevin Sekniqi (Ava Labs).