Working with Compressed Tar Files in Go

https://medium.com/learning-the-go-programming-language/working-with-compressed-tar-files-in-go-e6fe9ce4f51d
Vladimir VivienJul 20 · 8 min read

@vladimirvivien?source=post_page-----e6fe9ce4f51d----------------------">Vladimir Vivien

This post shows how to use the archive and the compress packages to create code that can programmatically build or extract compressed files from tar-encoded archive files). Both packages use Go’s streaming IO idiom which makes it easy to read or write data from diverse sources that can be compressed and archived.

Source code for this post https://github.com/vladimirvivien/go-tar

Tar

A tar file is a collection of binary data segments (usually sourced from files). Each segment starts with a header that contains metadata about the binary data, that follows it, and how to reconstruct it as a file.

  1. +---------------------------+
  2. | [name][mode][uid][guild] |
  3. | ... |
  4. +---------------------------+
  5. | XXXXXXXXXXXXXXXXXXXXXXXXX |
  6. | XXXXXXXXXXXXXXXXXXXXXXXXX |
  7. | XXXXXXXXXXXXXXXXXXXXXXXXX |
  8. +---------------------------+
  9. | [name][mode][uid][guild] |
  10. | ... |
  11. +---------------------------+
  12. | XXXXXXXXXXXXXXXXXXXXXXXXX |
  13. | XXXXXXXXXXXXXXXXXXXXXXXXX |
  14. +---------------------------+

The tar Package

Let us start with a simple example that uses in-memory data (synthetic files) and tar that data into archive file out.tar. This is to illustrate how the different pieces of the tar package works.

The next section shows how to create tar files from actual file sources.

The next code snippet creates a function value assigned to tarWrite which loops through the provided map (files) to create the tar segments for the archive:

  1. import(
  2. "archive/tar"
  3. ...
  4. )
  5. func main() {
  6. tarPath := "out.tar"
  7. files := map[string]string{
  8. "index.html": `<body>Hello!</body>`,
  9. "lang.json": `[{"code":"eng","name":"English"}]`,
  10. "songs.txt": `Claire de la lune, The Valkyrie, Swan Lake`,
  11. }
  12. tarWrite := func(data map[string]string) error {
  13. tarFile, err := os.Create(tarPath)
  14. if err != nil {
  15. log.Fatal(err)
  16. }
  17. defer tarFile.Close()
  18. tw := tar.NewWriter(tarFile)
  19. defer tw.Close()
  20. for name, content := range data {
  21. hdr := &tar.Header{
  22. Name: name,
  23. Mode: 0600,
  24. Size: int64(len(content)),
  25. }
  26. if err := tw.WriteHeader(hdr); err != nil {
  27. return err
  28. }
  29. if _, err := tw.Write([]byte(content)); err != nil {
  30. return err
  31. }
  32. }
  33. return nil
  34. }
  35. ...
  36. if err := tarWrite(files); err != nil {
  37. log.Fatal(err)
  38. }
  39. }

Sour file https://github.com/vladimirvivien/go-tar/simple/tar1.go

In the previous snippet, variable tw is created as a *tar.Writer which uses tarFile as its target. For each (synthetic) file from map data, a tar.Header is created which specifies a file name, a file mode, and a file size. The header is then written with tw.WriteHeader followed by the content of the file using tw.Write.

There are many more tar header fields. The three illustrated are the minimum required to create a functional archive.

When the code is executed, it will create file out.tar. We can inspect that the archive is properly created using the tar -tvf command:

We can see that the tar contains all three files as expected. However, note that because we used incomplete header information, some file information is either wrong or missing (such as the date, file ownership, etc).

To test the generated tar, use command tar -xvf out.tar to extract the files.

Programmatically, the files contained in the archive can be extracted using the tar package as well. The following source snippet opens the tar file and reconstruct its content on stdout:

  1. func main() {
  2. tarPath := "out.tar"
  3. tarUnwrite := func() error {
  4. tarFile, err := os.Open(tarPath)
  5. if err != nil {
  6. return err
  7. }
  8. defer tarFile.Close()
  9. tr := tar.NewReader(tarFile)
  10. for {
  11. hdr, err := tr.Next()
  12. if err == io.EOF {
  13. break // End of archive
  14. }
  15. if err != nil {
  16. return err
  17. }
  18. fmt.Printf("Contents of %s: ", hdr.Name)
  19. if _, err := io.Copy(os.Stdout, tr); err != nil {
  20. return err
  21. }
  22. fmt.Println()
  23. }
  24. return nil
  25. }
  26. ...
  27. if err := tarUnWrite(files); err != nil {
  28. log.Fatal(err)
  29. }
  30. }

In the previous snippet, variable tr, of type *tar.Reader, is used to extract the files from archive file tarFile . Using a forever-loop, the code visit each archive segment in order to reconstruct it by printing the content to standard out. The first step is to get the section’s header and ensure the file is not at EOF using tr.Next(). If not at EOF, then code reads the content of the section (using io.Copy) and prints it.

While these examples are functionally complete, they are not the best way to use the package. The next sections introduce couple of functions that adds more robust nuances to work with tar files.

Tar from Files

After reading the previous section, readers should be familiar with the pieces necessary to create tar-encodedarchives and extract files from them programmatically. This section, however, explores the more common usage of building tar files from file sources.

Function tartar , in the following snippet, creates a tar file from a list of specified paths. It uses function filetpath.Walk and filepath.WalkFunc (from package path) to walk the specified file tree:

  1. import(
  2. "path/filepath"
  3. )
  4. func tartar(tarName string, paths []string) error {
  5. tarFile, err := os.Create(tarName)
  6. if err != nil {
  7. return err
  8. }
  9. defer tarFile.Close()
  10. tw := tar.NewWriter(tarFile)
  11. defer tw.Close()
  12. for _, path := range paths {
  13. walker := func(f string, fi os.FileInfo, err error) error {
  14. ...
  15. // fill in header info using func FileInfoHeader
  16. hdr, err := tar.FileInfoHeader(fi, fi.Name())
  17. ...
  18. // calculate relative file path
  19. relFilePath := file
  20. if filepath.IsAbs(path) {
  21. relFilePath, err = filepath.Rel(path, f)
  22. if err != nil {
  23. return err
  24. }
  25. }
  26. hdr.Name = relFilePath
  27. if err := tw.WriteHeader(hdr); err != nil {
  28. return err
  29. }
  30. // if path is a dir, go to next segment
  31. if fi.Mode().IsDir() {
  32. return nil
  33. }
  34. // add file to tar
  35. srcFile, err := os.Open(f)
  36. ...
  37. defer srcFile.Close()
  38. _, err = io.Copy(tw, srcFile)
  39. if err != nil {
  40. return err
  41. }
  42. return nil
  43. }
  44. if err := filepath.Walk(path, walker); err != nil {
  45. fmt.Printf("failed to add %s to tar: %s\n", path, err)
  46. }
  47. }
  48. return nil
  49. }

Full source github.com/vladimirvivien/go-tar/tartar/tartar.go

For the most part, this follows the same approach as before, where all of the work is done inside the walker function block. Here, however, instead of creating the tar header manually, function tar.FileInfoHeader is used to properly copy os.FileInfo (from fi).

Note, that when a directory is encountered, the code simply write the header and moves on to the next file without writing any content. This creates directory entry as an archive header which will allow fidelity of the tree structure to be maintained in the tar file.

When this code creates a tar, we can see that all of the file header information got added properly and includes proper time/date, ownership, file mode, etc:

Next, let us look at how the content of the archive can be extracted into a file tree on the filesystem programmatically. The following code uses function untartar to extract and reconstruct files from tar file tarName into path xpath:

  1. func untartar(tarName, xpath string) (err error) {
  2. tarFile, err := os.Open(tarName)
  3. ...
  4. defer tarFile.Close()
  5. absPath, err := filepath.Abs(xpath)
  6. ...
  7. tr := tar.NewReader(tarFile)
  8. // untar each segment
  9. for {
  10. hdr, err := tr.Next()
  11. if err == io.EOF {
  12. break
  13. }
  14. if err != nil {
  15. return err
  16. }
  17. // determine proper file path info
  18. finfo := hdr.FileInfo()
  19. fileName := hdr.Name
  20. absFileName := filepath.Join(absPath, fileName)
  21. // if a dir, create it, then go to next segment
  22. if finfo.Mode().IsDir() {
  23. if err := os.MkdirAll(absFileName, 0755); err != nil {
  24. return err
  25. }
  26. continue
  27. }
  28. // create new file with original file mode
  29. file, err := os.OpenFile(
  30. absFileName,
  31. os.O_RDWR|os.O_CREATE|os.O_TRUNC,
  32. finfo.Mode().Perm(),
  33. )
  34. if err != nil {
  35. return err
  36. }
  37. fmt.Printf("x %s\n", absFileName)
  38. n, cpErr := io.Copy(file, tr)
  39. if closeErr := file.Close(); closeErr != nil {
  40. return err
  41. }
  42. if cpErr != nil {
  43. return cpErr
  44. }
  45. if n != finfo.Size() {
  46. return fmt.Errorf("wrote %d, want %d", n, finfo.Size())
  47. }
  48. }
  49. return nil
  50. }

Again, the extraction mechanism is similar to how it was done previously. Within a forever loop, method tr.Next is used to access the next header in the archive file. If the header is for a directory, the code creates the directory and moves on to the next header.

Recall in the tartar function, the header.Name is forced to be a relative path. This ensures that the file is placed in the proper subdirectory when it is extracted.

If the header is for a file, the file is created using os.OpenFile. This ensures that the file is created with the proper permission value. Finally, the code uses function io.Copy to transfer content from the archive into the newly created file.

Adding Compression

The compress package offers several compression formats (including gzip, bzip2, lzw, etc) that can easily be incorporated in your code. Again, since both archive/tar and compress/gzip packages are implemented using Go’s streaming IO interfaces, it is trivial change the code to compress the content of the archive file using gzip.

The following snippet updates function tartar to use the gzip compression when the archive file ends in .gz:

  1. import(
  2. "compress/gzip"
  3. )
  4. func tartar(tarName string, paths []string) (err error) {
  5. tarFile, err := os.Create(tarName)
  6. ...
  7. // enable compression if file ends in .gz
  8. tw := tar.NewWriter(tarFile)
  9. if strings.HasSuffix(tarName, ".gz"){
  10. gz := gzip.NewWriter(tarFile)
  11. defer gz.Close()
  12. tw = tar.NewWriter(gz)
  13. }
  14. defer tw.Close()
  15. ...
  16. }

The previous code update is all that is necessary be able to compress the content being added the archive. The io.Writer instances tw and gz are chained with tarFile allowing bytes destined to tarFile to be compressed as they are pipelined through gz. Pretty sweet!

Files compressed using tartar can be inspected using the gzip command:

  1. > gzip -l tartar.tar.gz
  2. compressed uncompressed ratio uncompressed_name
  3. 724385 6213632 88.3% tartar.tar

Programmatically, the code can decompress tar-encoded content while unpacking files from the archive. The following code snippet updates function untartar to chain io.Readers tarFile, gz, and tr:

  1. func untartar(tarName, xpath string) (err error) {
  2. tarFile, err := os.Open(tarName)
  3. ...
  4. tr := tar.NewReader(tarFile)
  5. if strings.HasSuffix(tarName, ".gz") {
  6. gz, err := gzip.NewReader(tarFile)
  7. if err != nil {
  8. return err
  9. }
  10. defer gz.Close()
  11. tr = tar.NewReader(gz)
  12. }
  13. ...
  14. }

With this change, the program will automatically decompress the content of the archived files using gzip. The same chaining strategy can be done to support other compression algorithms that implement the streaming IO API.

Conclusion

The archive and the compress packages in Go demonstrate how a powerful standard library can help programmers build serious tools. Both packages use Go’s streaming IO constructs to work with compressed tar-encoded files. For the astute or curious reader, you are encouraged to update the code to use other archive or compression algorithms.

ft_authoradmin  ft_create_time2019-08-14 14:06
 ft_update_time2019-08-14 14:10