Compression-based Distance (CBD)


Compression-based distance (CBD) is a simple, rapid, and accurate method to quantitatively analyze similarities between microbiota samples. CBD takes advantage of the repetitive nature of hypervariable tag datasets and the well-established field of compression algorithms to assess similarities between microbial communities according to the degree that their concatenated datasets can be compressed. The following figure shows how CBD assesses similarity between two microbiota.

Figure - Schematic of how CBD measures distance between two microbiotas. Two microbial communities (denoted X and Y) have been characterized using 16S variable tag sequencing. C(X), C(Y), and C(XY) denote the compressed sizes of dataset X, dataset Y, and the concatenation of datasets X and Y. The less similarity between X and Y, the less compression their concatenated dataset, XY, undergoes. In the limit of completely different microbiomes, the size of the compressed dataset C(XY) is equal to the sum of each compressed dataset, C(X) + C(Y). Conversely, when datasets X and Y are very similar, C(XY) is smaller than C(X) + C(Y), leading to a smaller CBD value.

How to Cite CBD

Yang, F., N. Chia, B. A. White, L. B. Schook. Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison. BMC Bioinformatics 2013, 14:136

License and Terms of Use

Logo of the University of Illinois at Urbana-Champaign