1. What is a colored de Bruijn graph and its applications?
A colored de Bruijn graph (cdBG) is commonly used to represent a sequence database, such as a collection of sequencing experiments or a collection of assembled genomes. It is used by tools for inferring phylogenies, quantifying RNA expression, and studying the evolution of antimicrobial resistance. As sequence database sizes grow to petabytes, the cost of storing or transferring the data has underscored the need for efficient disk compression algorithms. Large file sizes also hamper tool development and reproducibility efforts. Indexing cdBGs have received much attention, and various data structures have been developed to optimize various metrics such as index size, construction time, or query time. However, for disk compression, the simplest option is to compress each color independently, which may not exploit the properties of cdBGs and result in subpar compression ratios. There exist two tools designed specifically for disk compression of cdBGs, but they are limited in k-mer sizes. In this paper, an algorithm called ESS-color is designed, implemented, and evaluated for the disk compression of cdBGs, achieving better compression than all evaluated tools and on all datasets. The absolute compression ratio is more than 26x on datasets of assembled genomes and between 1.4x and 8.7x on datasets from sequencing experiments. The software is available at http://github.com/medvedevgroup/ESSColor.
read more
2. How is the color matrix of E built and compressed?
The color matrix of E is built by compressing E using ESS-basic, then ordering by ESS. First, nucleotide sequences are compressed using ESS-basic on all input files, resulting in the union ESS. Next, the compressed file is decompressed to obtain an SPSS of E (S). An SSHash dictionary is created to map k-mers in E to their ranks in S. Binary files representing E are read using the KMC API to output a color matrix ordered by the SSHash dictionary. The union ESS, S, and color matrix are used in later stages but not retained in the final compression output.
read more
3. What are the three types of encoding used for the local class table in the Global Class Table section?
The three types of encoding used for the local class table in the Global Class Table section are Skip, Small class difference, and End of run. Skip encoding is invoked when a k-mer is not the first or last in its simplitig and has the same class as the preceding and succeeding k-mers. Small class difference encoding is used when the Hamming distance between the color vector of a k-mer and the preceding k-mer is between 0 and maxDif. End of run encoding is used when a k-mer has the same class as the preceding k-mer and either has a different class than the succeeding k-mer or is the last k-mer in the simplitig. Each encoding type is indicated by appending specific bits to the bitvector m.
read more
4. What trade-offs are involved in setting U seLocalID and maxDif parameters?
Setting U seLocalID and maxDif parameters involves trade-offs that are difficult to quantify in advance. The cost of storing the local class table may exceed the benefits of using less bits to encode class IDs for a simplitig where every present class ID is contained within a single run. Similarly, when d is too large, writing the positions of the color differences to m can take more space than just writing the class ID. Setting d = 0 can save one bit per run by using '1' instead of '11' for the 'end of run' encoding. All bitvectors are additionally compressed with RRR, making it difficult to determine in advance which parameters result in the least space. Therefore, all possibilities of maxDif {0, 1, 2} and U seLocalID {True, False} are tried, and the encoding that takes less space is used. The decompression algorithm for the m vector is straightforward since the color matrix compression scheme is designed to be unambiguously decompressed. This allows for streaming out k-mer sequence and color vectors, one k-mer at a time, if the output is to be processed downstream in a streaming manner.
read more