Menu

Calculate the Size of a Cassandra Table

May 5, 2017 by Christopher Sherman

A Cassandra data store is made up of a collection of Column Families often referred to as tables. Within each table is a collection of columns. These columns consist of a combination of metadata and data.

PropertyMetadataValue
Name2 bytes (length as short int)byte[]
Flags1 byte
Timestamp8 bytes (long)
Value4 bytes (length as int)byte[]
if Counter Column8 bytes (timestamp of lsat delete)
if Expiring Column4 bytes (TTL) + 4 bytes (local deletion time)

As shown above, Cassandra stores at least 15 bytes worth of metadata for each column. Counter columns require an additional eight bytes of overhead as do expiring columns (columns with the time-to-live value set). In addition to metadata, we need space for the name of each column and the value stored within it, shown above as a byte array.

column_size = column_metadata + column_name_value + column_value

For example, if we have a column with an integer for its name (four bytes) and a long for its value (eight bytes), we end up with a column size of 27 bytes:

column_size = 15 bytes + 4 bytes + 8 bytes = 27 bytes

Note that not every column has a value. If the partition key is equal to the value of a column, that column will not duplicate the value of the partition key.

Clustering keys also have empty values. Clustering keys are additional columns used for ordering. Each of these columns sets its name property to the clustering key and leaves the value empty. Columns with empty values consist of 15 bytes of column metadata plus the size of the column name.

Sets of columns are organized by partition key. Every partition key requires 23 bytes of metadata. Cassandra uses partition keys to disperse data throughout a cluster of nodes and for data retrieval.

PropertyMetadataValue
Parition Key2 bytes (length as short)byte[]
Flags1 byte
Column Family ID4 bytes (int)
Local Deletion Time4 bytes (int)
Marked for Delete Time8 bytes (long)
Column Count4 bytes (int)

partition_key_size = partition_key_metadata + partition_key_value

Sets of columns within a table are often referred to as rows. A partition can hold multiple rows when sets share the same partition key. To calculate the size of a row, we need to sum the size of all columns within the row and add that sum to the partition key size.

row_size = sum_of_all_columns_ size_within_row + partition_key_size

To calculate the size of a partition, sum the row size for every row in the partition. A shortcut is to average the size of data within a row. With this simplifying assumption, the size of a partition becomes:

partition_size = row_ size_average * number_of_rows_in_this_partition

Assuming the size of the partition key is consistent throughout a table, calculating the size of a table is almost identical to calculating the size of a partition.

table_data_size = row_ size_average * number_of_rows

To calculate the size of a table, we must account for the cluster’s replication factor. Total table size is a function of table data size times the replication factor. Based on the replication factor, Cassandra writes a copy of each partition to other nodes in the cluster.

total_table_size = table_data_size * replication_factor

If the replication factor is set to one (data is stored on a single node in the cluster) there is no additional overhead for replication. For clusters with a replication factor greater than one, total table size scales linearly.

Conclusion 

Knowing how to calculate the size of a Cassandra table allows you to estimate the effect different data models will have on the size of your Cassandra cluster. Keep in mind that in addition to the size of table data described in this post, there is additional overhead for indexing table data and organizing tables on disk.

Cassandra