Thanks to Vitalik Buterin, Chih-Cheng Liang and Alex Stokes for helpful comments

A proof of custody is a construction that helps against the “lazy validator” problem. A lazy validator is a validator that instead of doing the work they are supposed to do – for example, ensuring that some data is available (relevant for data sharding) or that some execution was performed correctly (for execution chains) – they pretend that they’ve done it and sign the result, for example an attestations that claims the data is available anyway.

The proof of custody construction is a cryptoeconomic primitive that changes the game theory so that lazy validating simply isn’t an interesting strategy anymore.

Lazy validators – the game theory

Let’s assume there is a well-running Ethereum 2.0 chain (insert your favourite alternative PoS blockchain if you prefer). We don’t usually expect that bad things – data being withheld, invalid blocks being produced happens. In fact, you are likely to not see them ever happen, because as long as the system is run by a majority of honest validators there is no point in even trying to attack it in one of these ways. Since the attack is pretty much guaranteed to fail, there is no point in even doing it.

Now assume you run a validator. This comes with different kinds of costs – obviously the staking captial, but also hardware costs, electricity and internet bandwidth, which you might pay for directly (your provider charges you per GB) or indirectly (when your validator is running, your netflix lags). The lower you can make this cost, the more net profits you make from running your validator.

One of the tasks you do as a validator in sharded Eth2, is to assure the availability of shard data. Each attestation committee is assigned one blob of data to check, which is around 512 kB to 1 MB. The task of each validator is to download it and store it for around 90 days.

But what happens if you simply sign all attestations for shard blobs, without actually downloading the data? You would still get your full rewards, but your costs have suddently decreased. We are assuming the network is in a good state, so your laziness isn’t going to do anything to the network immediately. Let’s say your profit of running a validator was $1 per attestation, and the cost of downloading all the blocks was $0.10 per year. Now your profit has increased to $1.10.

  Profit per signed attestation
Honest $1.00
Lazy $1.10

This problem is called the verifier’s dilemma and was introduced in Demystifying Incentives in the Consensus Computer by Luu et al.

But I would never do this! Who would cheat like that?

It often seems obvious to us that in games like this, surely you would not succumb to bribery and stay with the honest behaviour. But it’s often more subtle than that.

Let’s assume that after having run a validator for years, a new client comes out that claims to be 10% more cost effective. People run it and see that it works, and it seems to be safe. The way it actually does this is by not downloading the shard blocks.

This could even happen by accident. Someone cut some corners in the development process, everything looks normal, it’s just that it doesn’t join the right shard subnet and nobody missed this, because it does not cause any faults in normal operation.

Some people will probably run this client.

Something else that could happen is that a service could step in to do the downloading for you. For $0.01 per shard blob, they will download the data, store it for 90 days, and send you a message that the data is available and you can sign the attestation. How bad is this?

It’s also quite bad. Because as many people start using this service, it becomes a single point of failure. Or even worse, it could be part of an attack. If it can make more than 50% of validators vote for the availability of a shard blob, without ever publishing the blob, that would be a withholding attack.

As it is often the case, dishonesty can come in many disguises, so our best bet is to work on the equilibrium to make the honest strategy rational.

A proof of custody and an update to the game theory

The proof of custody works like this: Imagine we can put a “bomb” in a shard blob: If you sign this blob, you get a large penalty (you get slashed), of $3,000. You definitely don’t want to sign this blob.

Does that make you want to download it? That is certainly one way to avoid signing the bomb. But if anyone can detect the bomb, then someone can simply write a service that warns you before signing an attestation if it’s a bomb. So the bomb needs to be specific to an individual validator, and noone else can compute whether a shard blob is a bomb.

OK, now we have the essential ingredients for the proof of custody. We need

  1. An ephemeral secret, that is recomputed every custody epoch (ca. 90 days), individual to each validator, and then revealed when it has expired (so that other validators have a chance to check the proof of custody)
  2. A function that takes the whole shard blob data, as well as the ephemeral key, and outputs 0 (not a bomb), or, with very small probability, 1 (this blob is a bomb)

It is essential that the ephemeral secret isn’t made available to anyone else, so there are three slashing conditions:

  1. A validator can get slashed if anyone knows its current ephemeral secret
  2. The ephemeral secret has to be published after the custody period, and failing to do so also leads to slashing
  3. Signing a bomb leads to slashing

How can we create this function? A simple construction works like this. Compute a Merkle tree of leaves (data0, secret, data1, secret, data2, secret, ...) as illustrated here:

graph TB A[Root] -->B[Hash] A --> B1[Hash] B --> C[Hash] B --> C1[Hash] C --> D[data0] C --> E[secret] C1 --> D1[data1] C1 --> E1[secret] B1 --> C2[Hash] B1 --> C3[Hash] C2 --> D2[data2] C2 --> E2[secret] C3 --> D3[data3] C3 --> E3[secret]

Then take the logical AND of the first 10 bits. This gives you a single bit that’s 1 in an expected 1 in 1024 times.

This function cannot be computed without knowing both the secret and the data.

(Because we do want to enable secret shared validators, a lot of work has gone into optimizing this function so that it can be efficiently computed in an MPC, which a Merkle tree cannot. For this we are suggesting a construction based on a Universal Hash Function and the Legendre symbol: https://ethresear.ch/t/using-the-legendre-symbol-as-a-prf-for-the-proof-of-custody/5169)

New game theory

All right, so with the proof of custody, any shard blob has a 1/1,024 chance of being a bomb, and you don’t know which one it is without downloading it.

The lazy validator does just fine when the blob is not a bomb. However, when it is a bomb, we see the big difference: The honest validator simply skips this attestation, which is very minor an simply sets the profit to zero. However, the lazy validator signs it and will get slashed, making a huge loss. The payoff matrix now looks like this:

  Profit for non-bomb attestation Profit for bomb attestation Average for 1,024 attestations
Honest $1.00 $0.00 $1,023.00
Lazy $1.10 $-3,000.00 $-1,873.60

In the third column, we see that the expected profit for the lazy validator is now negative. Since the whole reason for being lazy was increased profits from lower costs, this means that the lazy validator is not an interesting strategy anymore.

Proof of custody for execution

Another task of validators will be verifying the correct execution of blocks. This means verifying that the new stateroot that is part of a block is the correct one that results from applying all the transactions. The proof of custody idea can also be applied to this: The validator will have to compute the proof of custody in the same way as described above, however the data is the execution trace. The execution trace is some output generated by the step by step execution of the block. It does not have to be complete in any sense; what we want from it is just two properties:

  1. It should be difficult to guess the execution trace without actually executing the block.
  2. The total size of the execution trace should be large enough that simply distributing it in addition to normal blocks is unattractive.

There are some easy options of doing this; for example simply outputting every single instruction byte that the EVM executes would probably result in an execution trace of a few MB per execution block. Another option would be to use the top of the stack.

With fraud proofs, do we still need the proof of custody for execution?

When we upgrade the execution chain to statelessness, which means that blocks can be verified without having the current state, fraud proofs become easy. (Without statelessness, they are hard: Fraud proofs always have to be included on a chain different from the one where the fraud happened, and thus the actual pre-state would not be available when they have to be verified.)

This means that it will be possible to slash a validator who has produced an invalid execution block. Furthermore we can also penalize any validator that has attested to this block. Would that mean that the proof of custody is no longer necessary?

It does certainly shift the balance. But even with this penalty present, lazy validation can still be a rational strategy. It would probably be a bad idea for a validator to simply sign every block without verifying execution, as an attacker only needs to sacrifice a single validator of their own to get you slashed.

However, you can employ the following strategy: On each new block, you wait for some small percentage of other validators to sign it before you sign it yourself. Those who sign it first are unlikely to be lazy validators, as they would be employing the same strategy. This would get you quite good protection in most situations, but at a systemic level it would still leave the chain vulnerable in extreme cases.

The case with fraud proofs is thus improved, but a proof of custody remains superior for ensuring that lazy validation can’t be a rational strategy.

How is it different from data availability checks?

I wrote a primer on data availability checks here. It looks like the proof of custody for shard blobs tries to solve a very similar problem: Ensuring that data that is committed to in shard blob headers is actually available on the network.

So we may wonder: Do we need both a proof of custody and data availability checks?

There is an important difference between the two constructions, though:

  • Data availability checks ensure the availability of the data independent of the honest majority assumption. Even a powerful attacker controlling the entirety of the stake can’t trick full nodes into accepting data is available that is actually withheld
  • In contrast, a proof of custody does not help if the majority of the stake is performing an attack. The majority can compute the proof of custody without ever releasing the data to anyone else.

So in a theoretical sense, data availability checks are strictly superior to proof of custody for shard data: They hold unconditionally, whereas the latter only serve to keep rational validators honest, making an attack less likely.

Why do we still need a proof of custody for shard blobs? It might not necessarily be needed. There are however some practical problems with data availability checks that make it desirable to have a “first line of defence” against missing data:

The reason for this is that data availability checks work by excluding unavailable blocks from the fork choice rule. However, this cannot be permanent: data availability checks only ensure that eventually, everyone will see the same result, but not immediately.

The reason for this is that publishing a partially available block, might result in some nodes seeing it as available (they are seeing all their samples) and some other nodes as unavailable (missing some of the samples). Data availability checks ensure that in this situation, the data can always be reconstructed. However, this needs some node to first get enough samples to reconstruct the data, and then re-seed the samples so everyone can see them; this process can take a few slots.

In order to avoid a minority attacker (with less than 1/3 of the stake) to cause such a disruption, we only want to apply data availability checks when the chain is finalized and not immediately. In the meantime, the proof of custody can ensure that an honest majority will only ever build an available chain, where the shard data is already seeded in committees; since the committees are ready to re-seed all samples even if the original blob producer doesn’t, an attacker can’t easily force a partially available block.

In this construction, the proof of custody and data availability checks have two orthogonal functions:

  1. The proof of custody for shard data ensures that an honest majority of validators will only ever build a chain in which all shard data is available and well seeded across committees. A minority attacker cannot easily cause disruption to this.
  2. Data availability checks will guarantee that even if the majority of stake is attacking, they will not be able to get the remaining full nodes to consider a chain with withheld data as finalized.