Cassandra Deletion & Tombstones

September 10, 2019 by Chris Sherman

Cassandra stores data in immutable files on disk, and this data is typically replicated across multiple nodes. To record the fact that a delete occurred, Cassandra writes a special value called a tombstone. The tombstone indicates Cassandra considers the values deleted, and Cassandra will not return the tombstoned data in future queries.

Background

Availability: To ensure availability, Cassandra replicates data to distinct nodes according to a replication factor (RF). The RF defines the number of copies to keep in each datacenter per Cassandra keyspace. When any node goes down, the RF allows Cassandra to maintain availability by reading from other replicas.

Consistency: To ensure consistent reads and writes, Cassandra enforces consistency levels (CLs) and a RF.

CL.READ = Number of nodes required to acknowledge a read before Cassandra considers the query successful.

CL.WRITE = Number of nodes required to acknowledge a write before Cassandra considers the operation successful.

RF = Number of nodes to which data gets copied.

CL.READ + CL.READ > RF

A common configuration for a Cassandra cluster is a replication factor of three with read and write consistency levels each of two.

RF = 3

CL.READ = RF/2 + 1 = 2

CL.WRITE = RF/2 + 1 = 2

CL.READ + CL.WRITE > RF ==> 4 > 3

With the configuration above, Cassandra attains high availability as there is no single point of failure. If a node goes down, a read will successfully fetch the written data on at least one node and apply the Last Write Wins (LWW) algorithm to choose which node is holding the correct data.

The Problem With Deletes in Distributed Databases

Consider a Cassandra cluster of three nodes with a RF of three along with read and write CL each of two. We issue a delete command that succeeds on two-out-of-three nodes. According to the CL, Cassandra considers this delete command successful. However, the next read involving the node on which the delete failed will be ambiguous because there is no way for Cassandra to determine what is correct: return an empty response (from a node successfully marked as deleted) or return the data (from the node on which the delete failed). Cassandra will consider returning the data the correct action, so this delete results in reappearing data and unpredictable behavior.

Tombstones

Tombstones in Cassandra are additional columns stored alongside existing data. A delete command inserts the tombstone, initially as a new file. Tombstones address the problem of deleting data from a system that uses immutable files to store data. By writing a tombstone to a (logically) adjacent file, the LWW algorithm will not choose the node on which delete failed, because the node on which the delete succeeded will have a more recent write.

An unintuitive aspect of tombstones is they actually increase the amount of data on disk in order to achieve the intended purpose of removing data. Scanning tombstone data has costs beyond disk utilization, as the more data we have to consult per row, the slower reads become. To address this issue, Cassandra uses a compaction strategy to consolidate tombstones and, eventually, remove from disk the tombstone and the data we deleted.

Cassandra considers a tombstone eligible for removal when the time at which we issued the delete command exceeds certain thresholds, one of which is gc_grace_seconds. As we previously discussed, a central job of a tombstone is to communicate a delete operation to other nodes in the cluster who may not have successfully received the delete operation. The gc_grace_seconds gives Cassandra time to repair nodes in the cluster suffering from failed operations. Once the current timestamp minus the tombstone timestamp exceeds the gc_grace_seconds threshold, Cassandra will evaluate the deleted data and its tombstone for removal from disk, subject to a few other safety rules.

Monitoring Tombstones

Since tombstones cost us in terms of disk utilization and increased query times, monitoring the amount of tombstoned data is an important part of monitoring the health of a Cassandra cluster. We can use the SSTablemetadata command to get information on a particular table’s tombstone ratio. We can also use the TombstoneScannedHistogram metric to output more detailed information.

Conclusion

Performing deletion in a distributed database necessitates some additional complexity in order to maintain node consistency, one of which is tombstones. While Cassandra has built-in processes for managing tombstoned data, monitoring this metadata generated by delete commands is important for maintaining the health and responsiveness of a Cassandra cluster.

Sources

Rodriguez, Alain (2016, July 27). About Deletes and Tombstones in Cassandra. Retrieved from https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Cassandra