# Primer on data availability checks

This post is an attempt to explain data availability checks, and why they are needed in scaling solutions for blockchains, such as Ethereum 2.0. I assume a basic background in blockchains (Bitcoin, Ethereum) and ideally some background in the consensus algorithms that are used (Proof of Work and Proof of Stake). For simplicity, I will be describing this for a Proof of Stake chain in which all full nodes run the consensus protocol with equal weights and a $2/3$ honesty assumption; but it applies the same to Proof of Work and other protocols.

## Setting the scene

Think of a blockchain that has full nodes and light clients, as well as a peer to peer network that may be lossy but does not adaptively censor data. Light clients are a cheaper alternative to full nodes. In traditional blockchain protocols, we assume all clients are running full nodes that verify every transaction in a state transition. Running a full node requires a machine with large amounts of memory, computational power and bandwidth. This cost can be too high for mobile clients and many resource-constrained environments.

A light client is a node that only downloads the header from each block and trusts the full nodes to check that the state transitions are correct – and assumes that the consensus algorithm would not produce a chain that violates this. Light clients rely on fulll nodes to provide the information inside the blocks for any relevant transaction. This is likely to be a very small percentage of the total data on chain.

So for the purposes of this introduction, we will have three categories actors:

Full nodes produce a chain via consensus on each block and always download all data as well as verifying the state. Whenever they see a block with an inconsistent state (i.e. the final state of the block is not consistent with the transactions in that block), they produce a fraud proof to alert the light clients of this.

Light clients only download block headers (not transaction data and state), except for the transactions and parts of the state they are interested in. They are connected to full nodes to request the data they need.

A full node has the following security guarantees:

• A consensus (super-)majority of the other full nodes can create an alternative chain, and thus perform a double spending attack; more generally, they can create alternative versions of history with transactions arbitrarily reordered
• Since it is are checking the state, even a supermajority of the other full nodes agreeing on an inconsistent state can never get an honest full node to agree to this chain.

So for a full node the security assumption is that $2/3$ of full nodes are honest to guarantee that transactions will not be reordered, but no honesty assumption whatsoever is needed to ensure correct state execution (a full node simply can never be tricked into accepting an incorrect state transition).

For light clients, things are slightly different, because they don’t download and verify the state. The naïve light client without fraud proofs (which we’ll come to below) can thus be tricked into believing that a chain is good by a supermajority ($2/3$) of the full nodes, even if it actually has an incorrect state transition.

## Fraud proofs

Fraud proofs are a way to give the light client a better security model, which is closer to that of the full node. The aim is that the light clients will also be protected from invalid chains, as long as there is at least one honest full node (a much weaker assumption than a $2/3$ majority).

How do fraud proofs achieve this? Let’s assume that the blockchain executes $n$ transactions $t_1, \ldots, t_n$ in the block $B$ with block header $H$. If we add an execution trail that stores the Merkle root of the state before and after each transaction, let’s call it $s_0, \ldots, s_n$, then a fraud proof can be construced if any of the transactions was executed incorrectly (i.e. its effect was not correctly applied to the state): If say $t_i$ is the faulty transaction, then giving the triplet $(s_{i-1}, t_i, s_i)$, together with Merkle proofs showing their inclusion in $H$, will constitute a fraud proof. In fact, we only need to include the parts of $s_{i-1}$ and $s_i$ that $t_i$ needs or affects. This fraud proof is much smaller than the original block $B$ and can thus be easily broadcasted in the network to warn light clients not to follow this chain.

So now we have a security assumption for light clients that’s much better than previously:

• $2/3$ of dishonest full nodes can create an alternative chain, and thus change history/reorder transaction (enabling, for example, a double spending attack)
• But in order to prevent an incorrect state transition, the assumption is now that there is at least one honest full node (which will create a fraud proof) and that the network is synchronous (so that you receive that fraud proof in time)

## Data availability problem

There is one gap in the solution of using fraud proofs to protect light clients from incorrect state transitions: What if a supermajority of the full nodes has signed a block header, but will not publish some of the data (in particular, it could be fraudulent transactions that they will publish later to trick someone into accepting printed/stolen money)? The honest full nodes, obviously, will not follow this chain, as they can’t download the data. But the light clients will not know that the data is not available since they don’t try to download the data, only the header. So we are in a situation where the honest full nodes know that something fishy is going on, but they have no means of alerting the light clients, as they are missing the piece of data that might be needed to create a fraud proof.

Couldn’t they just alert the light clients with another kind of message and tell them “hey, watch out, the data in this block is not available”? Yes, but the problem is that they can’t prove it: There is no proof of data being unavailable, so the simple fraud proof mechanism from above does not work.

What’s even worse is that it’s not an attributable fault. It is possible that some data gets lost due to bad networking conditions, and that data might reappear later. So if you are an honest node that sees a data unavailability alarm, and when you check you see that the data is actually there, you cannot for sure know who is at fault: It may be that the producer did not upload the data in the beginning, but only after the alert was created (producer fault), or it could be that it was a false alarm.

Because it’s not an attributable fault, we cannot punish either the producer or the challenger as a consequence of the alarm. That’s annoying because it basically means adding this functionality creates a DOS vector (Vitalik wrote down a very nice description of this problem here).

# The solution: data availability checks using erasure codes

The solution to this conundrum is to make sure that light clients can know whether the data is actually available. Because if they know that the data is available, they know there will likely be a honest full node who has seen and checked it – and broadcasted a fraud proof it it’s incorrect/fraudulent.

Of course we don’t want the light clients to have to download the full chain and state to do this – then they wouldn’t be light clients anymore. So instead we will let them download random chunks of data and check that they are available. If you try downloading 100 different chunks of data, and you get all of them, you can be pretty sure most of the data is available (e.g., if less than $50\%$ were available, the probability of you successfully downloading 100 chunks would be $2^{-100}\approx 10^{-30}$, an incredibly small number).

However, it only proves that most data is available – let’s say a mere 100 bytes in a 10 megabyte block of data are missing: The chances that you will request something from exactly that bit of data are slim. And one hundred bytes are enough to hide an evil transaction from the honest fraud provers.

So we need to do something to the data to make sure that those checks actually ensure that all the data will be available. This can be done using erasure codes. An erasure code replaces the block data $B$ with a larger amount of data $E$ with the property that some fixed percentage $% $ will always be enough to reconstruct the whole data. So even if some of the data is missing, as long as light clients can make sure a large enough fraction is available, they know that $B$ can be reconstructed.

Now we are ready to define the behaviour of light clients in the presence of data availability checks. For every block header they download, they will now try to download $k$ random chunks of $E$ in order to assess whether the data is actually available. If they can download all of those chunks, then with probability $1-q^k$ there is actually enough data to reconstruct the whole block on the network.

Using this mechanism, there is no need for full nodes to “alert” the light clients of data not being available. By just downloading a small amount of data, they can test it for themselves and know.

# Example for erasure codes: Reed-Solomon codes

How do we actually construct erasure codes? A simple and well-known example are Reed-Solomon codes. They are based on the simple fact that any polynomial of degree $d$ over a field is uniquely determined by its evaluation at $d+1$ points. For example, if the polynomial is of degree 1 (a line), then knowing the polynomial at two points is sufficient to know the whole polynomial (there is only one line going through two distinct points).

To work with polynomials we have to work over a finite field, as otherwise the coefficients and evaluations could become arbitrarily large. Luckily, there are fields of any size $2^m$ available (the so-called binary fields or Galois fields over $\mathbb{F}_2$), so that we don’t have to work over some prime field $\mathbb{F}_p$ (though we may have to do that for other reasons in certain schemes).

So let’s say that we have $n$ chunks of data $d_0, \ldots, d_{n-1}$ which we want to erasure code. In order to do this with a Reed-Solomon code, we will interpolate a polynomial $\displaystyle f(x) = \sum_{i=0}^{n-1}a_i x^i$ of degree $d=n-1$ that evaluates to $d_0$ at $0$, i.e. $f(0)=d_0$, $f(1)=d_1$, and so on. We know such a polynomial exists, and in fact the Lagrange interpolation polynomials give us an explicit way to construct it (there are more efficient ways, though).

We now “extend” the data by evaluating the polynomial at some more points – for example $n$ more points if we want to set the rate $q=0.5$. Thus $d_n = f(n), d_{n+1} = f(n+1), \ldots, d_{2n-1} = f(2n-1)$. We thus have the property that any $n$ points will be sufficient to reconstruct the polynomial – and if we have the polynomial $f(x)$, we can also easily get our original data by just evaluating it at $0, \ldots, n-1$.

That’s all! Reed-Solomon codes are nothing more than some polynomial interpolation. This would in fact be the end of the story for data availability, as they are optimal in terms of coding efficiency – except for one small problem: There is another way in which fraud can happen, which is producing an incorrect encoding. And for Reed-Solomon codes, in order to prove that an encoding is incorrect, you have to provide $n$ chunks of data, sufficient to interpolate a polynomial through $n-1$ of them and showing that the last one is not on this polynomial. That’s why we’re currently doing lots of research on finding ways to either avoid having to do those incorrect encoding proofs or making them as small as possible.

# Application to sharding

Data availability checks are important for many different blockchain scaling solutions, because allow to provide security to nodes even if they are unable to check all or even download all the data. Since this is a fundamental bottleneck of blockchains (consensus nodes having to download all data), that’s an important requirement for scaling.

For example, in Ethereum 2.0, validators will only fully validate the beacon chain, and the shards will be validated by committees. The point of this construction is to relieve the validators from having to validate everything. However, this means that the validators are actually light clients on most shards (except for the ones they are doing active work on). Thus, data availability checks are needed. In this case, the Ethereum 2.0 validators are actually “full nodes” and light clients at the same time. The kind of nodes that really download all shards and check them are called supernodes – these are likely only run by organizations or people staking so much that they would validate on all shards. And we definitely don’t want to just trust that this small minority has to be honest in order to run Ethereum 2.0.

Hence, it is absolutely essential to have data availability checks and fraud proofs so that normal people can run validators.