martes, 28 de mayo de 2013


Data Compression Concepts and Algorithms and Their Applications to Bioinformatics

by Ozkan U. Nalbanto ̃ lu, David J. Russell and Khalid Sayood 


Compressing data, involves understanding the way information is structured and, if possible, the mechanism by which the information was generated or is destined to be used. Thus, in order to compress speech it helps to know that the speech production process can be modeled by an autoregressive moving average filter excited by a signal with a periodic and a noise-like component.

Life is strongly associated with organization and structure. Living organisms can be viewed as agents communicating with their environment and storing information necessary for adaptation to the environment. This information storage, both in content and form, is shaped by the process of evolution and transmitted from one generation to the next via the DNA molecule.

In this work we look at how the tools and theoretical constructs, which have been so useful in the development of data compression algorithms, can be used to understand biologically important molecules which can be represented as sequences, such as DNA, RNA, and proteins. 

We briefly review compression algorithms developed for compressing biological sequences, however, the attention of this document main focus is on how conceptual tools in the data compression repertoire have been used in the field of bioinformatics.

This document observed at how the concepts of Shannon Entropy, average mutual information, Kolmogorov complexity and grammar-based modeling have been used in bioinformatics. There is an excellent review of compression algorithms for biological sequences in. Therefore, we provide only a brief survey of compression algorithms for biological sequences and focus more on their use as a measure of distance between biological sequences. We conclude with a section on grammar based techniques and their application in bioinformatics.

A bit of Biology

The major focus of bioinformatics is on three kinds of molecules, DNA, RNA, and proteins, all of which can be represented as sequences. The DNA molecule is made up of a concatenation of four different kinds of nucleotides, Thymine, Adenine, Cytosine, and Guanine.

The DNA molecule can be represented as a sequence by representing each nucleotide by the first letter of the corresponding nucleobase (T, A, C, and G). DNA is a double stranded molecule with neighboring strands connected through hydrogen bonding between the nucleobases. This double stranded nature of the DNA molecule makes it more robust to errors than the single stranded RNA molecule and provides a mechanism for accurate reproduction of the information.

The regions of the DNA molecule that act as blueprints for proteins are called genes. An entire gene is transcribed by the RNA polymerase enzyme into an RNA molecule. The ribose sugar in the RNA molecule differs from the deoxyribose sugar in the DNA molecule in that a hydroxyl (OH) group. Furthermore, the nucleobases that are part of the ribose nucleotides that make up the RNA molecule are Uracil (U), Adenine (A), Cytosine (C), and Guanine (G), with Uracil replacing Thymine.

The transcription from DNA to RNA is through base pairing every C is transcribed as a G, a G is transcribed as a C, a T is transcribed as an A and an A is transcribed as a U. The RNA molecule created through the action of RNA polymerase is processed and portions excised resulting in a messenger RNA (mRNA). Each gene can be transcribed multiple times and the transcript can be translated multiple times before it is subjected to degradation. The level of transcription and translation is controlled through a complex set of regulation mechanisms which involve, among other things, the binding of regulatory proteins to sites on the DNA close to the gene.

Entropy and Biology

Given that DNA is a means of transmitting information between generations it was natural that the concepts of entropy and information be applied to understanding DNA. One of the first contributions to the use of the concept of information in bioinformatics was the work of Lila Gatlin in the 1960’s. Gatlin proposed a definition for the information content of DNA which was essentially a measure of the divergence of the DNA sequence from an iid sequence. Given an alphabet of size N , where N is four for DNA sequences and 20 for amino acid sequences, Gatlin defined two quantities D1 and D2 which measured divergence from the equiprobable state and divergence from independence, respectively.

 Gatlin connects this definition of information to redundancy by noting that defining redundancy as

result in

Based on the limited data available at that time, Gatlin showed empirically that DNA from vertebrates, bacteria, and phage (viruses that prey on bacteria) can be distinguished by looking at their information content and that there is increasing redundancy in the DNA of organisms as we move from lower complexity organisms like bacteria to higher complexity organisms such as vertebrates. Plotting the data available to her as shown in the next figure it is easy to see why she would come to that conclusion.

The logo as created by Schneider and Stephens is widely used in various bioinformatic applications as they provide information useful to biologists. A tall stack of letters at a particular location implies that the corresponding site is important for some reason. Schneider and colleagues in other work have used logos for studying DNA protein interaction, to investigate variants and look for novel genes. The utility of the entropy concept in such a wide range of application suggests that perhaps these concepts are natural to biological sequences and a more wide-ranging and deeper analysis would be a fruitful endeavor.

Application of Average Mutual Information

What entropy is to lossless compression, the rate distortion function is to lossy compression. Both provide a bound on the efficacy of compression. The rate distortion function depends on two things, the distortion measure and average mutual information between the source coder input and the source coder output. The average mutual information between random variables X and Y is given by

The ability of the average mutual information to expose relatedness has been used in a number of ways in bioinformatics. Then the average mutual information can be used to identify bases in DNA sequences, or residues in amino acid sequences that in some way depend upon each other. This dependence can help understand something about the two and three dimensional structure of the virus, or protein.

Korber et al. in a groundbreaking study identified correlated mutations in the envelope protein of the HIV-1 virus. By mutating the envelope protein the virus tries to keep the host’s defense off-balance. This effect is clearly seen in AMI charts developed by Sayood et al. when studying the differences between infected infants who succumbed to HIV and those who did not. The AMI chart is simply a representation of the average mutual information values in the form of a matrix where the (i, j)th pixel in the chart represents the average mutual information between the ith residue and the jth residue of the envelope protein. 

The difference in the charts shows the differing characteristics of the populations with substantial levels of mutation continuing in those patients who remained asymptomatic an indication perhaps of continuing efforts of the viral population to overcome the host defenses.

The DNA of organisms contains regions which code for proteins and regions which do not. In eukaryotes these regions are interspersed in genes. Because the protein coding regions are interpreted in terms of triplets, one would expect a periodicity of three in the dependence between bases in coding regions, and no such periodicity in non-coding regions. This is indeed the case and the existence or non-existence of such periodicity in the average mutual information can be used to distinguish between coding and non-coding regions.

The concept of average mutual information has also been used to understand relationships between expression patterns of genes. A popular method for observing a cell in action is through the use of microarrays. By measuring the amount of messenger RNA present under different conditions these arrays indicate which genes are active in each condition and the level of activity. A gene seldom acts alone and in order to understand the causal relationships between genes it is important to find which genes behave similarly. A number of studies use average mutual information to obtain this clustering

Compression and Phylogeny

General purpose compression algorithms do not perform well with biological sequences, resulting
quite often in expansion rather than compression. This document main focus, however, is on the relationship of compression concepts with evolutionary distance metrics.

One of the most well-known DNA compressors is Gencompress. Gencompress uses the fact that DNA sequences contain tandem repeats, multiple copies of genes and palindromic sequences. It involves a modified Lempel-Ziv (LZ) algorithm which searches for reverse complements and approximate repeats. Approximate repeats are subsequences which can be transformed into a copy of the original subsequences using a small number of edit operations (substitution, insertion, deletion). Copying an approximate repeat and modifying it with edit operations is shown to be cheaper in terms of bits than describing the sequence under construction in some other manner.

The following video shows as working the Lempel Ziv:


Note: if you do not understand the video, in references is the link which explains how it is that works in conjunction with this gif.

The phylogeny is the evolutionary history of the development of a group of organisms. And is the study of the evolutionary relationships between different groups of organisms, using information matrices and DNA molecules morfología.2 With this information established phylogenetic trees, based on phylogenetic classification. This classification is part of the systematic, further comprising also phenetic classification systems and classical Linnean.
Benedetto et al. use this measure of similarity to determine authorship, and for determining the phylogeny of languages. This latter application is what interests us as there is a great deal of literature which suggests that languages evolved through a process of natural selection similar to biological evolution. Therefore, a measure of distance which seems to reflect evolutionary distance between languages would also be useful in understanding phylogenetic relationships between organisms.

That the basic principle behind the Lempel-Ziv compression algorithms have been so successful in identifying evolutionary relationships may mean that the differences uncovered through the use of compression are somehow natural to the evolutionary process. This speculation is further supported by the exploitation of distance metrics based on compression for protein classification and genome segmentation [61]. Kocsor et al. showed that using compression based approaches can be more accurate for protein classification than the commonly used Smith-Waterman alignment algorithm or Hidden Markov Models. Pelta showed that the compression of protein contact maps can be used for protein classification and showed that the UNIX compress algorithm can be used for protein domain identification.

Grammar and Biology

The Lempel-Ziv algorithms, though usually not thought of that way, are examples of the use of
a grammar for compression. The original concept behind abstract grammars is that a grammar G is
meant to completely describe the underlying structure of a corpus of sequences. Because most naturally
occurring sequences contain repetition and redundancy, grammars are often able to describe sequences
efficiently. Hence, the usage of grammars can be thought of as a means to provide compression.

The next figure show how works this type of compression.

Identification of function and/or meaning of segments of biological sequences remains an ongoing and active area of research. This means studying primary structure, or the sequential ordering, of sequences and the secondary structure, or the three-dimensional shapes that form due to attractions that occur among separated segments within the sequences. A somewhat uncommon method for predicting RNA secondary structure focuses only on the information contained within the sequences.

The concepts behind data compression have been very useful in understanding how information is organized in a number of signals used in multimedia communications. It is a natural step to go from analyzing signals such as speech, audio and video to analyzing biological “signals” such as DNA, RNA, and proteins. The results have been somewhat counterintuitive. Instead of these techniques being useful in the development of compression algorithms for biological signals, these concepts have been most useful in illuminating various biological relationships. These range from providing a species signature to providing tools for analyzing the behavior of gene regulation networks. Reviewing the wide variety of places where these concepts have been useful it is difficult to escape the feeling that information theory, in particular those aspects of it that relate to data compression, are somehow organic to the area of bioinformatics. Is believe this will be a fascinating field of study for many years to come and a productive area in which people with an understanding of these concepts can make valuable contributions.

[*] Ozkan U. Nalbanto ̃ lu, David J. Russell and Khalid Sayood. "Data Compression Concepts and Algorithms and Their Applications to Bioinformatics". Entropy 2010, 12.

[*] Pablo Morales. "Que es la Filogenia". 9 de julio de 2010.

[*] "Animation of Lempel-Ziv Encoding Algorithm" gif lempel ziv.

Images taken from the pdf

1 comentario:

  1. Pusiste bien chiquitas las gráficas... Tu formato de referencias aún no le alcanza a lo que yo espero ;) Una conclusión propia hubiera sido bienvenida. Van 3 pts extra en la última tarea.