Philip Bille, DTU Compute

Shortcut through data jungle

Tuesday 19 May 15
by Iben Julie Schmidt


Philip Bille
DTU Compute
+45 45 25 36 47

New methods of compressing data may be the solution to many of the challenges posed by growing volumes of ‘big data’. New smart compression algorithms enable working directly in the compressed data. In addition to saving time and storage space, the technology will be able to process even larger data volumes.

BIG DATA is on everyone’s lips, as there is huge potential in the data explosion we are currently experiencing. Regardless of whether your chosen field is disease research, marketing, climate problems, or understanding the universe, it seems that the answers we seek are hidden in the petabytes of data that are generated in an endless stream.

But as promising as the growing data volumes may be, they pose major challenges in the form of insufficient storage space, computer power, memory, bandwidth, and not least, time—which is why data compression is an increasingly important factor. At DTU Compute, a researcher has become something of a data compression pioneer since showing it was possible to work in compressed data.

Jumping around in compressed data
“I showed that you can take compressed data—in this case a classic Lempel-Ziv compression—and then build something on top of them, enabling you to jump around in the data and carry out different tasks without first having to decompress them, and that was a real eye-opener. What we really did was invent a new way of representing data. By combining a range of different, classic techniques we created a new internal data structure,” explains Philip Bille, Associate Professor at DTU Compute.

The new method attracted attention, as the ability to work in compressed data without the need for storage space or spending time on decompressing the data, can open up new opportunities. One of the areas where the method has already proved viable is video surveillance. 

Search in surveillance video
“Surveillance cameras typically record idle footage, i.e. hours and hours of video where very little or nothing is happening so you can save it in a compact way where it takes up far less space—but at the same time it’s an advantage if you can quickly search for specific things. That way you avoid having to sift through endless hours of video surveillance. In a project funded by the Danish National Advanced Technology Foundation, we collaborated with—among others—the video software company Milestone to develop a search function that can perform various smart searches: For example, you can select a specific area of the recording and then perform a search when you know when something has taken place in precisely that area. And our job has been to render the data as small as possible,” explains Philip Bille.

DNA codes can be compressed efficiently
A second area where the method can play an important role is in gene sequencing. Since the development of Next Generation Sequencing, the price of DNA sequencing has fallen drastically, and this has meant that gene sequencing information is one of the areas where data volumes are growing the fastest. In projects such as Genome Denmark, for example, scientists have sequenced the complete genome of 30 Danes in order to conduct research into the link between genes and diseases, and this requires a lot of data capacity. However, this is only the beginning of a development where more and more people as well as other organisms will be exposed to complete genome sequencing.

“It is precisely in the field of gene sequencing that it makes really good sense to compress data, for even though a full genome takes up over 3 billion basepairs, DNA sequences from the same race are incredibly similar. Over 99 per cent of the genetic code is completely identical, so once you have saved the genome, it’s possible to compress the others relative to the first, enabling them to be compressed very efficiently and stored in compressed form. The exciting prospect is to develop these compression methods that at the same time enable us to work in data—for example, search for specific patterns or gene variations without first having to decompress the large volumes of data. And this is one of the goals of our future research,” explains Philip Bille.

Algorithms of the future
Associate Professor Philip Bille was recently awarded the prestigious Sapere Aude grant of DKK 7 million by The Danish Council for Independent Research. This means that in the coming years he and Associate Professor Inge Li Gørtz, two PhD students, two postdocs, and partners from Israel, Norway and Finland will be able to continue their research into searching and indexing in compressed data, among other things.

Philip Bille’s team of researchers will work with both existing compression algorithms and try to build future algorithms designed so that they can be worked in from the outset.