GRANDPA Equivocation and sysinfo Process Collection Results In Slashing on Kusama Network: a Post-Mortem.
Multiple bugs in code resulted in nodes dropping out from Kusama network and losing the database that stores which blocks they validated. Consequently, the same nodes double-signed those blocks on restart. The slashes caused by this issue have been reverted via Kusama Council motions.
On Friday July 31, two Kusama validators on runtime version v2019 started crashing every few minutes giving two distinctive errors, reporting an issue. At a first glance, the problem seemed to be related to the validators' keys. It was subsequently found that this was not the cause, as the validators affected confirmed they did not change keys in the process. Additionally, the issue seemed to be present solely on Kusama network, not on Polkadot.
Going a bit further down the rabbit hole, the team realised that the issue seemed to have started as a result of a GRANDPA equivocation causing a slash event in Kusama, originally triggered by a file descriptor leak that caused nodes to crash. This leak prevented nodes from writing the GRANDPA voter state (the votes at a given round) to disk and caused the nodes that lost this data to vote again after restarting, this time voting for a block newer than their original choice. This led to an equivocation.
The combination of these two events resulting in validators being slashed started at some point after v0.8.15 (v2015 in Kusama) was released and the network was upgraded. The Authority Discovery feature had already been in place for some time on the runtime module level but not enabled by default on the client, and this version also enabled GRANDPA to report equivocations on unsigned extrinsics.
With this information in hand, the team's main hypothesis was that equivocations caused by the file descriptors leak could actually have started happening a while ago but were only reported after the v0.8.15 upgrade back in July: by running this version of the network, nodes started reporting themselves after crashing and this attracted the attention of the teams involved. Still, investigation into the logs of nodes run by Parity did not find any previous equivocation (they would be logged to the terminal).
Further investigation into the root causes of the file descriptor leak pointed at two main culprits: authority discovery and metrics collection. Authority discovery was using an excessive amount of sockets to query data from the DHT (i.e. discovering other authorities IP addresses). For system metrics collection (e.g. CPU and memory) we were relying on the sysinfo crate which was keeping a cache of file descriptors over all processes in the system and threads for each process (it's fetching the data by reading from /proc).
The short-term solution was to disable the Authority Discovery feature by default and also to stop collecting system metrics. The Authority Discovery module will be re-enabled again in a future release once there is a proper fix for the excessive use of sockets.
Until a new version was available the Parity team recommended manually disabling Authority Discovery. Additionally, in any case of the node crashing, validators were advised to introduce a delay before restarting it (1-2 minutes). This reduces the likelihood of the node equivocating in GRANDPA if its votes were not persisted to disk.
After some discussions and developments, Polkadot v0.8.22 was released, including the short-term fixes detailed above. All validators should upgrade their version and monitor for results. All slashes caused by this bug were reverted by the Kusama Council - and in this spirit, a new discussion was opened regarding the reversion of economic loss but not the nomination loss by validators.
To keep up with developments, there are plenty of ways to get plugged in to the Kusama community. Join the discussion on the Direction Channel. Learn more about Kusama on our website and in the Kusama Wiki. Want to join the core growth team behind Kusama? Join the Ambassador Program.
From the blog
Elevating Polkadot's Performance and Scale with Asynchronous Backing
Asynchronous backing is the latest step in the roadmap towards natively scaling Polkadot’s performance and flexibility for Web3 use cases across every industry.
Polkadot Consensus Part 1: Enhanced Economic Security via NPoS
The first installment in the NPoS series describes how Polkadot maintains a more secure, open and more decentralized network due to its novel consensus system, NPoS.
Polkadot Blockchain Academy: Forging Web3’s Future Innovators
Polkadot Blockchain Academy, designed to shape promising developers into tomorrow's leading blockchain engineers, just wrapped up its 3rd wave in Berkeley, California.