Primer on data availability checks

This post is an attempt to explain data availability checks, and why they are needed in scaling solutions for blockchains, such as Ethereum 2.0. I assume a basic background in blockchains (Bitcoin, Ethereum) and ideally some background in the consensus algorithms that are used (Proof of Work and Proof of Stake). For simplicity, I will be describing this for a Proof of Stake chain with a $2/3$ honesty assumption; but it applies the same to Proof of Work and other protocols.

Setting the scene

Think of a blockchain that has full nodes and light clients, as well as a peer to peer network that may be lossy but does not adaptively censor data. Light clients are a cheaper alternative to full nodes. In traditional blockchain protocols, we assume all clients are running full nodes that verify every transaction in a state transition. Running a full node requires a machine with more resources, such as memory, computation and bandwidth. This cost can be too high for mobile clients and many resource-constrained environments.

A light client is a node that only downloads the header from each block and trusts the consensus to check that the state transitions are correct. Light clients rely on full nodes to provide the information inside the blocks for any relevant transaction. This is likely to be a very small percentage of the total data on chain.

So for the purposes of this introduction, we will have four categories of actors:

Consensus nodes produce a chain via consensus on each block and always run a Full node as well.

Full nodes always download all data as well as verifying the state. Whenever they see a block with an inconsistent state (i.e. the final state of the block is not consistent with the transactions in that block), they produce a fraud proof to alert the light clients of this.

Light clients only download block headers (not transaction data and state), except for the transactions and parts of the state they are interested in. They are connected to full nodes to request the data they need.

The peer to peer network distributes block headers and allows random access to chunks of data that were uploaded to it.

Due to the fact that full nodes check all state transitions, we can derive the following properties:

A consensus (super-)majority can create an alternative chain, and thus perform a double spending attack; more generally, they can create alternative versions of history with transactions arbitrarily reordered
However since full nodes are checking the state, even a consensus supermajority agreeing on an inconsistent state can never trick them into accepting it

So a full only requires the security assumption is that $2/3$ of consensus are honest to guarantee that transactions will not be reordered, but not to ensure correct state execution (a full node can simply never be tricked into accepting an incorrect state transition).

For light clients, things are slightly different, because they don’t download and verify the state. The naïve light client without fraud proofs (which we’ll come to below) can thus be tricked into believing that a chain is valid by a consensus supermajority ( $2/3$ ), even if it actually has an incorrect state transition.

Fraud proofs

Fraud proofs are a way to give the light client a better security model, which is closer to that of the full node. The aim is that the light clients will also be protected from invalid chains, as long as there is at least one honest full node (a much weaker assumption than an honest $2/3$ consensus majority).

How do fraud proofs achieve this? Let’s assume that the blockchain executes $n$ transactions $t_1, \ldots, t_n$ in the block $B$ with block header $H$ . If we add an execution trail that stores the Merkle root of the state before and after each transaction, let’s call it $s_0, \ldots, s_n$ , then a fraud proof can be construced if any of the transactions was executed incorrectly (i.e. its effect was not correctly applied to the state): If say $t_i$ is the faulty transaction, then giving the triplet $(s_{i-1}, t_i, s_i)$ , together with Merkle proofs showing their inclusion in $H$ , will constitute a fraud proof. In fact, we only need to include the parts of $s_{i-1}$ and $s_i$ that $t_i$ needs or affects. This fraud proof is much smaller than the original block $B$ and can thus be easily broadcast on the network to warn light clients not to follow this chain.

So now we have a security assumption for light clients that’s much better than previously:

a malicious $2/3$ consensus supermajority can create an alternative chain, and thus change history/reorder transactions (enabling, for example, a double spending attack)
But in order to prevent an incorrect state transition, the assumption is now that there is at least one honest full node (which will create a fraud proof) and that the network is synchronous (so that the light node receives that fraud proof in time)

Data availability problem

There is one gap in the solution of using fraud proofs to protect light clients from incorrect state transitions: What if a consensus supermajority has signed a block header, but will not publish some of the data (in particular, it could be fraudulent transactions that they will publish later to trick someone into accepting printed/stolen money)? Honest full nodes, obviously, will not follow this chain, as they can’t download the data. But light clients will not know that the data is not available since they don’t try to download the data, only the header. So we are in a situation where the honest full nodes know that something fishy is going on, but they have no means of alerting the light clients, as they are missing the piece of data that might be needed to create a fraud proof.

Couldn’t they just alert the light clients with another kind of message and tell them “hey, watch out, the data in this block is not available”? Yes, but the problem is that they can’t prove it: There is no proof of data being unavailable, so the simple fraud proof mechanism from above does not work.

What’s even worse is that it’s not an attributable fault. It is possible that some data gets lost due to bad networking conditions, and that data might reappear later. So if you are an honest node that sees a data unavailability alarm, and when you check you see that the data is actually there, you cannot for sure know who is at fault: It may be that the producer did not upload the data in the beginning, but only after the alert was created (producer fault), or it could be that it was a false alarm.

Because it’s not an attributable fault, we cannot punish either the producer or the challenger as a consequence of the alarm. That’s annoying because it basically means adding this functionality creates a DOS vector (Vitalik wrote down a very nice description of this problem here).

The solution: data availability checks using erasure codes

The solution to this conundrum is to make sure that light clients can know whether the data is actually available. Because if they know that the data is available, they know there will likely be a honest full node who has seen and checked it – and broadcast a fraud proof it it’s incorrect/fraudulent.

Of course we don’t want the light clients to have to download the full chain and state to do this – then they wouldn’t be light clients anymore. So instead we will let them download random chunks of data and check that they are available. If you try downloading 100 different chunks of data, and you get all of them, you can be pretty sure most of the data is available (e.g., if less than $50\%$ were available, the probability of you successfully downloading 100 chunks would be $2^{-100}\approx 10^{-30}$ , an incredibly small number).

However, it only proves that most data is available – let’s say a mere 100 bytes in a 10 megabyte block of data are missing: The chances that you will request something from exactly that bit of data are slim. And one hundred bytes are enough to hide an evil transaction from the honest fraud provers.

So we need to do something to the data to make sure that those checks actually ensure that all the data will be available. This can be done using erasure codes. An erasure code replaces the block data $B$ with a larger amount of data $E$ with the property that some fixed percentage $q<1$ will always be enough to reconstruct the whole data. So even if some of the data is missing, as long as light clients can make sure a large enough fraction is available, they know that $B$ can be reconstructed.

Now we are ready to define the behaviour of light clients in the presence of data availability checks. For every block header they download, they will now try to download $k$ random chunks of $E$ in order to assess whether the data is actually available. If they can download all of those chunks, then with probability $1-q^k$ there is actually enough data to reconstruct the whole block on the network.

Using this mechanism, there is no need for full nodes to “alert” the light clients of data not being available. By just downloading a small amount of data, they can test it for themselves and know.

Example for erasure codes: Reed-Solomon codes

How do we actually construct erasure codes? A simple and well-known example are Reed-Solomon codes. They are based on the simple fact that any polynomial of degree $d$ over a field is uniquely determined by its evaluation at $d+1$ points. For example, if the polynomial is of degree 1 (a line), then knowing the polynomial at two points is sufficient to know the whole polynomial (there is only one line going through two distinct points).

To work with polynomials we have to work over a finite field, as otherwise the coefficients and evaluations could become arbitrarily large. Luckily, there are fields of any size $2^m$ available (the so-called binary fields or Galois fields over $\mathbb{F}_2$ ), so that we don’t have to work over some prime field $\mathbb{F}_p$ (though we may have to do that for other reasons in certain schemes).

Let’s say that we have $n$ chunks of data $d_0, \ldots, d_{n-1}$ which we want to erasure code. In order to do this with a Reed-Solomon code, we will interpolate a polynomial $\displaystyle f(x) = \sum_{i=0}^{n-1}a_i x^i$ of degree $d=n-1$ that evaluates to $d_0$ at $0$ , i.e. $f(0)=d_0$ , $f(1)=d_1$ , and so on. We know such a polynomial exists, and in fact the Lagrange interpolation polynomials give us an explicit way to construct it (there are more efficient ways, though).

We now “extend” the data by evaluating the polynomial at some more points – for example $n$ more points if we want to set the rate $q=0.5$ . Thus $d_n = f(n), d_{n+1} = f(n+1), \ldots, d_{2n-1} = f(2n-1)$ . We thus have the property that any $n$ points will be sufficient to reconstruct the polynomial – and if we have the polynomial $f(x)$ , we can also easily get our original data by just evaluating it at $0, \ldots, n-1$ .

That’s all! Reed-Solomon codes are nothing more than some polynomial interpolation. This would in fact be the end of the story for data availability, as they are optimal in terms of coding efficiency – except for one small problem: There is another way in which fraud can happen, which is producing an incorrect encoding. And for Reed-Solomon codes, in order to prove that an encoding is incorrect, you have to provide $n$ chunks of data, sufficient to interpolate a polynomial through $n-1$ of them and showing that the last one is not on this polynomial. That’s why we’re currently doing lots of research on finding ways to either avoid having to do those incorrect encoding proofs or making them as small as possible.

Application to sharding

Note (2023): The description of Ethereum 2.0 in this section is now outdated. There is no such thing as Ethereum 2.0 anymore (all research streams are now becoming gradual upgrades to the original Ethereum) and currently execution sharding is not part of the Ethereum roadmap

Data availability checks are important for many different blockchain scaling solutions, because they allow to provide security to nodes even if they are unable to check all or even download all the data. Since this is a fundamental bottleneck of blockchains (full nodes having to download all data), that’s an important requirement for scaling.

For example, in Ethereum 2.0, validators will only fully validate the beacon chain, and the shards will be validated by committees. The point of this construction is to relieve the validators from having to validate everything. However, this means that validators are actually light clients on most shards (except for the ones they are doing active work on). Thus, data availability checks are needed. In this case, the Ethereum 2.0 validators are actually “full nodes” and light clients at the same time. The kind of nodes that really download all shards and check them are called supernodes – these are likely only run by organizations or people staking so much that they would validate on all shards. And we definitely don’t want to just trust that this small minority has to be honest in order to run Ethereum 2.0.

Hence, it is absolutely essential to have data availability checks and fraud proofs so that normal people can run validators.