A Cassandra data store is made up of a collection of Column Families often referred to as tables. Within each table is a collection of columns. These columns consist of a combination of metadata and data.
2 bytes (length as short int)
8 bytes (long)
4 bytes (length as int)
if Counter Column
8 bytes (timestamp of lsat delete)
if Expiring Column
4 bytes (TTL) + 4 bytes (local deletion time)
As shown above, Cassandra stores at least 15 bytes worth of metadata for each column. Counter columns require an additional eight bytes of overhead as do expiring columns (columns with the time-to-live value set). In addition to metadata, we need space for the name of each column and the value stored within it, shown above as a byte array.
Note that not every column has a value. If the partition key is equal to the value of a column, that column will not duplicate the value of the partition key.
Clustering keys also have empty values. Clustering keys are additional columns used for ordering. Each of these columns sets its name property to the clustering key and leaves the value empty. Columns with empty values consist of 15 bytes of column metadata plus the size of the column name.
Sets of columns are organized by partition key. Every partition key requires 23 bytes of metadata. Cassandra uses partition keys to disperse data throughout a cluster of nodes and for data retrieval.
Sets of columns within a table are often referred to as rows. A partition can hold multiple rows when sets share the same partition key. To calculate the size of a row, we need to sum the size of all columns within the row and add that sum to the partition key size.
To calculate the size of a partition, sum the row size for every row in the partition. A shortcut is to average the size of data within a row. With this simplifying assumption, the size of a partition becomes:
To calculate the size of a table, we must account for the cluster’s replication factor. Total table size is a function of table data size times the replication factor. Based on the replication factor, Cassandra writes a copy of each partition to other nodes in the cluster.
If the replication factor is set to one (data is stored on a single node in the cluster) there is no additional overhead for replication. For clusters with a replication factor greater than one, total table size scales linearly.
Knowing how to calculate the size of a Cassandra table allows you to estimate the effect different data models will have on the size of your Cassandra cluster. Keep in mind that in addition to the size of table data described in this post, there is additional overhead for indexing table data and organizing tables on disk.