+++
author = [“Alexander Neumann”]
title = “Splitting Data with Content-Defined Chunking”
date = 2018-12-01T00:00:00Z
series = [“Advent 2018”]
+++

In this post you’ll learn what Content-Defined Chunking (CDC) is and how you can
use it to split large data into smaller blocks in a deterministic way. These
blocks can be found again in other data later, even if the location of the
block is different than the first time. I wrote a small Go package to do the
splitting, which performs really well.

Why is splitting data an interesting problem?

In my spare time, I’m working on a fast backup program called
restic, which is written in Go. When a (possible large)
file is read, the program needs to save the data somewhere so that it can be
restored later. I think it is a good idea to split files into smaller parts,
which are more manageable for a program. It also allows detecting and efficiently
handling small changes in big files, like virtual machine images. Once such a
part of a file is saved, it can be referenced when the same data is contained
in different files, so restic de-duplicates the data it reads. The parts are
identified based on the SHA-256 hash of the contents, so a list of these hashes
can describe the content of a file.

There are different strategies for splitting files, the most obvious one would
be to just use static boundaries, and e.g. split after every megabyte of data.
This gives us manageable chunks, but if the data in the file is shifted, for
example by inserting a byte at the start of the file, all chunks will be
different and need to be saved anew. We can do better than that.

Rabin fingerprints

During the research I did before starting to implement what would later become
restic, I discovered a publication titled “Fingerprinting by Random
Polynomials”
by Michael O.
Rabin
published in 1981. It
describes a way to efficiently compute a “Rabin
fingerprint”
(or “hash”, in
modern terms) for a number of bytes. It has the great property that it can be
computed as a rolling hash: the
Rabin fingerprint of a window of bytes can be efficiently calculated while the
window moves through a very large file. The idea is that when a new byte is to
be processed, the oldest byte is first removed from the calculation, and the
new byte is factored in.

How can we use that to find chunk boundaries? Restic uses a window of 64 bytes,
so while reading a file, it computes the fingerprint of all such windows: byte
0 to byte 63, byte 1 to byte 64 and so on. It then checks every fingerprint if
the least significant bits are all zero. If this is the case, then we found a
chunk boundary! For random input bytes, the calculated fingerprints are roughly
random as well, so by varying the number of bits that are checked we can
(roughly) configure how large the resulting chunks will be. Since the cut
points for individual chunks only depend on the 64 bytes preceding them, this
method of splitting data is called “Content-Defined Chunking” (CDC).

Early on I recognized that the algorithm and implementation has the potential
to be very helpful to other projects as well, so I published it as a separate
package
for other people to import (API on
godoc.org
) with a very liberal
BSD 2-clause license.

Enough theory, let’s dive into the code!

Splitting data

The type which does all the hard work of computing the fingerprints over the
sliding window is contained in the Chunker type. The Rabin fingerprint needs
a so-called “irreducible polynomial” to work, which is encoded as an uint64.
The chunker package exports the function RandomPolynomial() which creates a
new polynomial for you to use.

Our first program will generate a random polynomial and create a new Chunker
which reads data from os.Stdin:

  1. p, err := chunker.RandomPolynomial()
  2. if err != nil {
  3. panic(err)
  4. }
  5. fmt.Printf("using polynomial %v for splitting data\n", p)
  6. chk := chunker.New(os.Stdin, p)

The Next() method on the Chunker reads data out the reader (os.Stdin in
this example) into a byte slice buffer. The methods returns a next Chunk and
an error. We call it repeatedly until io.EOF is returned, which tells us that
all data has been read:

  1. buf := make([]byte, 16*1024*1024) // 16 MiB
  2. for {
  3. chunk, err := chk.Next(buf)
  4. if err == io.EOF {
  5. break
  6. }
  7. if err != nil {
  8. panic(err)
  9. }
  10. fmt.Printf("%d\t%d\t%016x\n", chunk.Start, chunk.Length, chunk.Cut)
  11. }

Let’s get some random data to test:

  1. $ dd if=/dev/urandom bs=1M count=100 of=data
  2. 100+0 records in
  3. 100+0 records out
  4. 104857600 bytes (105 MB, 100 MiB) copied, 0,548253 s, 191 MB/s

Now if we feed the data in this file to the program we’ve just written
(code), it
prints the start offset, length, and fingerprint for the chunk it found:

  1. $ cat data | go run main.go
  2. using polynomial 0x3dea92648f6e83 for splitting data
  3. 0 2908784 000c31839a100000
  4. 2908784 2088949 001e8b4104900000
  5. 4997733 1404824 000a18d1f0200000
  6. 6402557 617070 001bc6ac84300000
  7. 7019627 1278326 001734984b000000
  8. 8297953 589913 0004cf802b600000
  9. 8887866 554526 001eb5a362900000
  10. 9442392 1307416 000ef1e549c00000
  11. [...]
  12. 102769045 864607 00181b67df300000
  13. 103633652 592134 0000a0c8b4200000
  14. 104225786 631814 001d5ba20d38998b

On a second run, it’ll generate a new polynomial and the chunks will be
different, so if you depend on your program computing the same boundaries,
you’ll need to pass in the same polynomial.

Let’s fix the polynomial (0x3dea92648f6e83) for the next runs and change the
program like this
(code):

  1. chk := chunker.New(os.Stdin, 0x3dea92648f6e83)

After replacing the randomly generated polynomial with a constant, every new
run will give us the same chunks:

  1. $ cat data | go run main.go
  2. 0 2908784 000c31839a100000
  3. 2908784 2088949 001e8b4104900000
  4. 4997733 1404824 000a18d1f0200000
  5. [...]
  6. 102769045 864607 00181b67df300000
  7. 103633652 592134 0000a0c8b4200000
  8. 104225786 631814 001d5ba20d38998b

Now we can experiment with the data. For example, what happens when we insert
bytes at the beginning? Let’s test that by inserting the bytes foo using the
shell:

  1. $ (echo -n foo; cat data) | go run main.go
  2. 0 2908787 000c31839a100000
  3. 2908787 2088949 001e8b4104900000
  4. 4997736 1404824 000a18d1f0200000
  5. 6402560 617070 001bc6ac84300000
  6. [...]

We can see that the first chunk is different: it’s three bytes longer (2908787
versus 2908784), but the fingerprint at the end of the chunk is the same. All
other chunks following the first one are also the same!

Let’s add a hash to identify the contents of the individual chunks
(code):

  1. for {
  2. chunk, err := chk.Next(buf)
  3. if err == io.EOF {
  4. break
  5. }
  6. if err != nil {
  7. panic(err)
  8. }
  9. hash := sha256.Sum256(chunk.Data)
  10. fmt.Printf("%d\t%d\t%016x\t%032x\n", chunk.Start, chunk.Length, chunk.Cut, hash)
  11. }

These are the chunks for the raw, unmodified file:

  1. $ cat data | go run main.go
  2. 0 2908784 000c31839a100000 8d17f5f7326a2fd7[...]
  3. 2908784 2088949 001e8b4104900000 ba7c71226609c0de[...]
  4. 4997733 1404824 000a18d1f0200000 9c4db21626a27d4b[...]
  5. 6402557 617070 001bc6ac84300000 4c249c3b88500c23[...]
  6. 7019627 1278326 001734984b000000 5bec562b7ad37b8b[...]
  7. [...]

Now we can see that adding bytes at the start only changes the first chunk:

  1. $ (echo -n foo; cat data) | go run main.go
  2. 0 2908787 000c31839a100000 f41084d25bc273f2[...]
  3. 2908787 2088949 001e8b4104900000 ba7c71226609c0de[...]
  4. 4997736 1404824 000a18d1f0200000 9c4db21626a27d4b[...]
  5. 6402560 617070 001bc6ac84300000 4c249c3b88500c23[...]
  6. 7019630 1278326 001734984b000000 5bec562b7ad37b8b[...]

The SHA-256 hash for the first chunk changed from 8d17f5f7326a2fd7[...] to
f41084d25bc273f2[...], but the rest are still the same. That’s pretty cool.

Let’s see how many different chunks we have in our data file by counting the
different hashes:

  1. $ cat data | go run main.go | cut -f4 | sort | uniq | wc -l
  2. 69

What happens when we feed the program our data twice, how many chunks does it detect? Let’s find out:

  1. $ (cat data; cat data) | go run main.go | cut -f4 | sort | uniq | wc -l
  2. 70

Huh, so we only got a single additional chunk. The cause is that for the first
round, it’ll find the same chunks as before (it’s the same data after all)
right until the last chunk. In the previous run, the end of the last chunk was
determined by the end of the file. Now there’s additional data, so the chunk
continues. The next cut point will in the second run of the data, it’ll be the
same as the first cut point, just with some additional data at the beginning of
the chunk, so the SHA-256 hash is different. Afterwards, the same chunks follow
in the same order.

Changing a few consecutive bytes of data somewhere in most cases also only
affects a single chunk, let’s write the output to a file, change the data a bit
using sed, and write another file:

  1. $ cat data | go run main.go > orig.log
  2. $ cat data | sed 's/EZE8HX/xxxxxx/' | go run main.go > mod.log

The string EZE8HX was present in my randomly generated file data only once
and the sed command above changed it to the string xxxxxx. When we now
compare the two log files using diff, we can see that for exactly one chunk
the SHA-256 hash has changed, but the other chunks and the chunk boundaries
stayed the same:

  1. $ diff -au orig.log mod.log
  2. --- orig.log 2018-11-24 13:14:45.265329525 +0100
  3. +++ mod.log 2018-11-24 13:19:32.344681407 +0100
  4. @@ -55,7 +55,7 @@
  5. 83545247 547876 00198b8839f00000 2cbeea803ba79d54[...]
  6. 84093123 889631 0007ddacabc00000 a2b6e8651ae1ab69[...]
  7. 84982754 2561935 000c53712f500000 bca589959b019a80[...]
  8. -87544689 2744485 0000a49118b00000 02b125104f06ed85[...]
  9. +87544689 2744485 0000a49118b00000 ce2e43496ae6e7c3[...]
  10. 90289174 1167308 00034d6cce700000 ad25a331993d9db7[...]
  11. 91456482 1719951 001d2fea8ae00000 ba5153845c5b5228[...]
  12. 93176433 1362655 0003009fc1600000 8e7a35e340c10d61[...]

This technique can be used for many other things besides a backup program. For
example, the program rsync uses content-defined chunking to efficiently
transfer files by detecting which parts of the files are already present on the
receiving side (with a different rolling hash). The Low Bandwidth Network File
System (LBFS)
uses CDC
with Rabin fingerprints to only transfer chunks that are needed over the
network. Another application of a rolling hash is finding strings in text, for
example in the Rabin-Karp
algorithm
.

If you’re interested in how the chunker package is used in restic, there’s a
post over at the restic blog
which explains this in a bit more detail.

Performance

Performance is crucial for backup programs. If creating backups is too slow
people tend to stop using it.

The chunker package is written in plain Go. It doesn’t have any fancy
low-level assembler, but the code is already pretty fast, thanks to some
optimized calculations. The package has some benchmarks:

  1. $ go test -run xxx -bench 'BenchmarkChunker$'
  2. goos: linux
  3. goarch: amd64
  4. pkg: github.com/restic/chunker
  5. BenchmarkChunker-4 20 70342669 ns/op 477.01 MB/s
  6. --- BENCH: BenchmarkChunker-4
  7. chunker_test.go:325: 23 chunks, average chunk size: 1458888 bytes
  8. chunker_test.go:325: 23 chunks, average chunk size: 1458888 bytes
  9. PASS
  10. ok github.com/restic/chunker 1.655s

Running on an older machine with an Intel i5 CPU chunker processes at about
477 MB per second on a single core.

Conclusion

Content-Defined Chunking can be used to split data into smaller chunks in a
deterministic way so that the chunks can be rediscovered if the data has
slightly changed, even when new data has been inserted or data was removed. The
chunker package provides a fast
implementation in pure Go with a liberal license and a simple API. If you use
the chunker package for something cool, please let us know by opening an
issue
in the repo over at
GitHub!

ft_authoradmin  ft_create_time2018-12-16 02:37
 ft_update_time2018-12-16 02:37