dwarfs Download - dwarfs Source code download

DwarFS

The Deduplicating Warp-speed Advanced Read-only File System.

A fast high compression read-only file system for Linux and Windows.

Overview
History
Building and Installing
- Note to Package Maintainers
- Prebuilt Binaries
- Universal Binaries
- Dependencies
- Building
- Installing
- Static Builds
Usage
Using the Libraries
Windows Support
- Building on Windows
macOS Support
- Building on macOS
Use Cases
- Astrophotography
Dealing with Bit Rot
Extended Attributes
Comparison
- With SquashFS
- With SquashFS & xz
- With lrzip
- With zpaq
- With zpaqfranz
- With wimlib
- With Cromfs
- With EROFS
- With fuse-archive
Performance Monitoring
Other Obscure Features
Stargazers over Time

Overview

Windows Screen Capture

Linux Screen Capture

DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data.

This probably doesn't sound very exciting, because if it's redundant, it should compress well. However, I found that other read-only, compressed file systems don't do a very good job at making use of this redundancy. See here for a comparison with other compressed file systems.

DwarFS also doesn't compromise on speed and for my use cases I've found it to be on par with or perform better than SquashFS. For my primary use case, DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources.

To give you an idea of what DwarFS is capable of, here's a quick comparison of DwarFS and SquashFS on a set of video files with a total size of 39 GiB. The twist is that each unique video file has two sibling files with a different set of audio streams (this is an actual use case). So there's redundancy in both the video and audio data, but as the streams are interleaved and identical blocks are typically very far apart, it's challenging to make use of that redundancy for compression. SquashFS essentially fails to compress the source data at all, whereas DwarFS is able to reduce the size by almost a factor of 3, which is close to the theoretical maximum:

$ du -hs dwarfs-video-test
39G     dwarfs-video-test
$ ls -lh dwarfs-video-test.*fs
-rw-r--r-- 1 mhx users 14G Jul  2 13:01 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 39G Jul 12 09:41 dwarfs-video-test.squashfs

Furthermore, when mounting the SquashFS image and performing a random-read throughput test using fio-3.34, both squashfuse and squashfuse_ll top out at around 230 MiB/s:

$ fio --readonly --rw=randread --name=randread --bs=64k --direct=1 
      --opendir=mnt --numjobs=4 --ioengine=libaio --iodepth=32 
      --group_reporting --runtime=60 --time_based
[...]
   READ: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=13.5GiB (14.5GB), run=60004-60004msec

In comparison, DwarFS manages to sustain random read rates of 20 GiB/s:

  READ: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1212GiB (1301GB), run=60001-60001msec

Distinct features of DwarFS are:

Clustering of files by similarity using a similarity hash function. This makes it easier to exploit the redundancy across file boundaries.
Segmentation analysis across file system blocks in order to reduce the size of the uncompressed file system. This saves memory when using the compressed file system and thus potentially allows for higher cache hit rates as more data can be kept in the cache.
Categorization framework to categorize files or even fragments of files and then process individual categories differently. For example, this allows you to not waste time trying to compress incompressible files or to compress PCM audio data using FLAC compression.
Highly multi-threaded implementation. Both the file system creation tool as well as the FUSE driver are able to make good use of the many cores of your system.

History

I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

Up until then, I had been using Cromfs for squeezing them into a manageable size. However, I was getting more and more annoyed by the time it took to build the filesystem image and, to make things worse, more often than not it was crashing after about an hour or so.

I had obviously also looked into SquashFS, but never got anywhere close to the compression rates of Cromfs.

This alone wouldn't have been enough to get me into writing DwarFS, but at around the same time, I was pretty obsessed with the recent developments and features of newer C++ standards and really wanted a C++ hobby project to work on. Also, I've wanted to do something with FUSE for quite some time. Last but not least, I had been thinking about the problem of compressed file systems for a bit and had some ideas that I definitely wanted to try.

The majority of the code was written in 2013, then I did a couple of cleanups, bugfixes and refactors every once in a while, but I never really got it to a state where I would feel happy releasing it. It was too awkward to build with its dependency on Facebook's (quite awesome) folly library and it didn't have any documentation.

Digging out the project again this year, things didn't look as grim as they used to. Folly now builds with CMake and so I just pulled it in as a submodule. Most other dependencies can be satisfied from packages that should be widely available. And I've written some rudimentary docs as well.

Building and Installing

Note to Package Maintainers

DwarFS should usually build fine with minimal changes out of the box. If it doesn't, please file a issue. I've set up CI jobs using Docker images for Ubuntu (22.04 and 24.04), Fedora Rawhide and Arch that can help with determining an up-to-date set of dependencies. Note that building from the release tarball requires less dependencies than building from the git repository, notably the ronn tool as well as Python and the mistletoe Python module are not required when building from the release tarball.

There are some things to be aware of:

There's a tendency to try and unbundle the folly and fbthrift libraries that are included as submodules and are built along with DwarFS. While I agree with the sentiment, it's unfortunately a bad idea. Besides the fact that folly does not make any claims about ABI stability (i.e. you can't just dynamically link a binary built against one version of folly against another version), it's not even possible to safely link against a folly library built with different compile options. Even subtle differences, such as the C++ standard version, can cause run-time errors. See this issue for details. Currently, it is not even possible to use external versions of folly/fbthrift as DwarFS is building minimal subsets of both libraries; these are bundled in the dwarfs_common library and they are strictly used internally, i.e. none of the folly or fbthrift headers are required to build against DwarFS' libraries.
Similar issues can arise when using a system-installed version of GoogleTest. GoogleTest itself recommends that it is being downloaded as part of the build. However, you can use the system installed version by passing -DPREFER_SYSTEM_GTEST=ON to the cmake call. Use at your own risk.
For other bundled libraries (namely fmt, parallel-hashmap, range-v3), the system installed version is used as long as it meets the minimum required version. Otherwise, the preferred version is fetched during the build.

Prebuilt Binaries

Each release has pre-built, statically linked binaries for Linux-x86_64, Linux-aarch64 and Windows-AMD64 available for download. These should run without any dependencies and can be useful especially on older distributions where you can't easily build the tools from source.

Universal Binaries

In addition to the binary tarballs, there's a universal binary available for each architecture. These universal binaries contain all tools (mkdwarfs, dwarfsck, dwarfsextract and the dwarfs FUSE driver) in a single executable. These executables are compressed using upx, so they are much smaller than the individual tools combined. However, it also means the binaries need to be decompressed each time they are run, which can have a significant overhead. If that is an issue, you can either stick to the "classic" individual binaries or you can decompress the universal binary, e.g.:

upx -d dwarfs-universal-0.7.0-Linux-aarch64

The universal binaries can be run through symbolic links named after the proper tool. e.g.:

$ ln -s dwarfs-universal-0.7.0-Linux-aarch64 mkdwarfs
$ ./mkdwarfs --help

This also works on Windows if the file system supports symbolic links:

> mklink mkdwarfs.exe dwarfs-universal-0.7.0-Windows-AMD64.exe
> .mkdwarfs.exe --help

Alternatively, you can select the tool by passing --tool=<name> as the first argument on the command line:

> .dwarfs-universal-0.7.0-Windows-AMD64.exe --tool=mkdwarfs --help

Note that just like the dwarfs.exe Windows binary, the universal Windows binary depends on the winfsp-x64.dll from the WinFsp project. However, for the universal binary, the DLL is loaded lazily, so you can still use all other tools without the DLL. See the Windows Support section for more details.

Dependencies

DwarFS uses CMake as a build tool.

It uses both Boost and Folly, though the latter is included as a submodule since very few distributions actually offer packages for it. Folly itself has a number of dependencies, so please check here for an up-to-date list.

It also uses Facebook Thrift, in particular the frozen library, for storing metadata in a highly space-efficient, memory-mappable and well defined format. It's also included as a submodule, and we only build the compiler and a very reduced library that contains just enough for DwarFS to work.

Other than that, DwarFS really only depends on FUSE3 and on a set of compression libraries that Folly already depends on (namely lz4, zstd and liblzma).

The dependency on googletest will be automatically resolved if you build with tests.

A good starting point for apt-based systems is probably:

$ apt install 
    gcc 
    g++ 
    clang 
    git 
    ccache 
    ninja-build 
    cmake 
    make 
    bison 
    flex 
    fuse3 
    pkg-config 
    binutils-dev 
    libacl1-dev 
    libarchive-dev 
    libbenchmark-dev 
    libboost-chrono-dev 
    libboost-context-dev 
    libboost-filesystem-dev 
    libboost-iostreams-dev 
    libboost-program-options-dev 
    libboost-regex-dev 
    libboost-system-dev 
    libboost-thread-dev 
    libbrotli-dev 
    libevent-dev 
    libhowardhinnant-date-dev 
    libjemalloc-dev 
    libdouble-conversion-dev 
    libiberty-dev 
    liblz4-dev 
    liblzma-dev 
    libzstd-dev 
    libxxhash-dev 
    libmagic-dev 
    libparallel-hashmap-dev 
    librange-v3-dev 
    libssl-dev 
    libunwind-dev 
    libdwarf-dev 
    libelf-dev 
    libfmt-dev 
    libfuse3-dev 
    libgoogle-glog-dev 
    libutfcpp-dev 
    libflac++-dev 
    nlohmann-json3-dev

Note that when building with gcc, the optimization level will be set to -O2 instead of the CMake default of -O3 for release builds. At least with versions up to gcc-10, the -O3 build is up to 70% slower than a build with -O2.

Building

First, unpack the release archive:

$ tar xvf dwarfs-x.y.z.tar.xz
$ cd dwarfs-x.y.z

Alternatively, you can also clone the git repository, but be aware that this has more dependencies and the build will likely take longer because the release archive ships with most of the auto-generated files that will have to be generated when building from the repository:

$ git clone --recurse-submodules https://github.com/mhx/dwarfs
$ cd dwarfs

Once all dependencies have been installed, you can build DwarFS using:

$ mkdir build
$ cd build
$ cmake .. -GNinja -DWITH_TESTS=ON
$ ninja

You can then run tests with:

$ ctest -j

All binaries use jemalloc as a memory allocator by default, as it is typically uses much less system memory compared to the glibc or tcmalloc allocators. To disable the use of jemalloc, pass -DUSE_JEMALLOC=0 on the cmake command line.

It is also possible to build/install the DwarFS libraries, tools, and FUSE driver independently. This is mostly interesting when packaging DwarFS. Note that the tools and FUSE driver require the libraries to be either built or already installed. To build just the libraries, use:

$ cmake .. -GNinja -DWITH_TESTS=ON -DWITH_LIBDWARFS=ON -DWITH_TOOLS=OFF -DWITH_FUSE_DRIVER=OFF

Once the libraries are tested and installed, you can build the tools (i.e. mkdwarfs, dwarfsck, dwarfsextract) using:

$ cmake .. -GNinja -DWITH_TESTS=ON -DWITH_LIBDWARFS=OFF -DWITH_TOOLS=ON -DWITH_FUSE_DRIVER=OFF

To build the FUSE driver, use:

$ cmake .. -GNinja -DWITH_TESTS=ON -DWITH_LIBDWARFS=OFF -DWITH_TOOLS=OFF -DWITH_FUSE_DRIVER=ON

Installing

Installing is as easy as:

$ sudo ninja install

Though you don't have to install the tools to play with them.

Static Builds

Attempting to build statically linked binaries is highly discouraged and not officially supported. That being said, here's how to set up an environment where you might be able to build static binaries.

This has been tested with ubuntu-22.04-live-server-amd64.iso. First, install all the packages listed as dependencies above. Also install:

$ apt install ccache ninja libacl1-dev

ccache and ninja are optional, but help with a speedy compile.

Depending on your distribution, you'll need to build and install static versions of some libraries, e.g. libarchive and libmagic for Ubuntu:

$ wget https://github.com/libarchive/libarchive/releases/download/v3.6.2/libarchive-3.6.2.tar.xz
$ tar xf libarchive-3.6.2.tar.xz && cd libarchive-3.6.2
$ ./configure --prefix=/opt/static-libs --without-iconv --without-xml2 --without-expat
$ make && sudo make install

$ wget ftp://ftp.astron.com/pub/file/file-5.44.tar.gz
$ tar xf file-5.44.tar.gz && cd file-5.44
$ ./configure --prefix=/opt/static-libs --enable-static=yes --enable-shared=no
$ make && make install

That's it! Now you can try building static binaries for DwarFS:

$ git clone --recurse-submodules https://github.com/mhx/dwarfs
$ cd dwarfs && mkdir build && cd build
$ cmake .. -GNinja -DWITH_TESTS=ON -DSTATIC_BUILD_DO_NOT_USE=ON 
           -DSTATIC_BUILD_EXTRA_PREFIX=/opt/static-libs
$ ninja
$ ninja test

Usage

Please check out the manual pages for mkdwarfs, dwarfs, dwarfsck and dwarfsextract. You can also access the manual pages using the --man option to each binary, e.g.:

$ mkdwarfs --man

The dwarfs manual page also shows an example for setting up DwarFS with overlayfs in order to create a writable file system mount on top a read-only DwarFS image.

A description of the DwarFS filesystem format can be found in dwarfs-format.

A high-level overview of the internal operation of mkdwarfs is shown in this sequence diagram.

Using the Libraries

Using the DwarFS libraries should be pretty straightforward if you're using CMake to build your project. For a quick start, have a look at the example code that uses the libraries to print information about a DwarFS image (like dwarfsck) or extract it (like dwarfsextract).

There are five individual libraries:

dwarfs_common contains the common code required by all the other libraries. The interfaces are defined in dwarfs/.
dwarfs_reader contains all code required to read data from a DwarFS image. The interfaces are defined in dwarfs/reader/.
dwarfs_extractor contains the ccode required to extract a DwarFS image using libarchive. The interfaces are defined in dwarfs/utility/filesystem_extractor.h.
dwarfs_writer contains the code required to create DwarFS images. The interfaces are defined in dwarfs/writer/.
dwarfs_rewrite contains the code to re-write DwarFS images. The interfaces are defined in dwarfs/utility/rewrite_filesystem.h.

The headers in internal subfolders are only accessible at build time and won't be installed. The same goes for the tool subfolder.

The reader and extractor APIs should be fairly stable. The writer APIs are likely going to change. Note, however, that there are no guarantees on API stability before this project reaches version 1.0.0.

Windows Support

Support for the Windows operating system is currently experimental. Having worked pretty much exclusively in a Unix world for the past two decades, my experience with Windows development is rather limited and I'd expect there to definitely be bugs and rough edges in the Windows code.

The Windows version of the DwarFS filesystem driver relies on the awesome WinFsp project and its winfsp-x64.dll must be discoverable by the dwarfs.exe driver.

The different tools should behave pretty much the same whether you're using them on Linux or Windows. The file system images can be copied between Linux and Windows and images created on one OS should work fine on the other.

There are a few things worth pointing out, though:

DwarFS supports both hardlinks and symlinks on Windows, just as it does on Linux. However, creating hardlinks and symlinks seems to require admin privileges on Windows, so if you want to e.g. extract a DwarFS image that contains links of some sort, you might run into errors if you don't have the right privileges.
Due to a problem in WinFsp, symlinks cannot currently point outside of the mounted file system. Furthermore, due to another problem in WinFsp, symlinks with a drive letter will appear with a mangled target path.
The DwarFS driver on Windows correctly reports hardlink counts via its API, but currently these counts are not correctly propagated to the Windows file system layer. This is presumably due to a problem in WinFsp.
When mounting a DwarFS image on Windows, the mount point must not exist. This is different from Linux, where the mount point must actually exist. Also, it's possible to mount a DwarFS image as a drive letter, e.g.

dwarfs.exe image.dwarfs Z:
Filter rules for mkdwarfs always require Unix path separators, regardless of whether it's running on Windows or Linux.

Building on Windows

Building on Windows is not too complicated thanks to vcpkg. You'll need to install:

Visual Studio and the MSVC C/C++ compiler
Git
CMake
Ninja
WinFsp

WinFsp is expected to be installed in C:Program Files (x86)WinFsp; if it's not, you'll need to set WINFSP_PATH when running CMake via cmake/win.bat.

Now you need to clone vcpkg and dwarfs:

> cd %HOMEPATH%
> mkdir git
> cd git
> git clone https://github.com/Microsoft/vcpkg.git
> git clone https://github.com/mhx/dwarfs

Then, bootstrap vcpkg:

> .vcpkgbootstrap-vcpkg.bat

And build DwarFS:

> cd dwarfs
> mkdir build
> cd build
> ..cmakewin.bat
> ninja

Once that's done, you should be able to run the tests. Set CTEST_PARALLEL_LEVEL according to the number of CPU cores in your machine.

> set CTEST_PARALLEL_LEVEL=10
> ninja test

macOS Support

The DwarFS libraries and tools (mkdwarfs, dwarfsck, dwarfsextract) are now available from Homebrew:

$ brew install dwarfs
$ brew test dwarfs

The macOS version of the DwarFS filesystem driver relies on the awesome macFUSE project. Until a formula has been added, you will have to build the DwarFS FUSE driver manually.

Building on macOS

Building on macOS should be relatively straightforward:

Install Homebrew
Use Homebrew to install the necessary dependencies:

$ brew install cmake ninja macfuse brotli howard-hinnant-date double-conversion 
               fmt glog libarchive libevent flac openssl nlohmann-json pkg-config 
               range-v3 utf8cpp xxhash boost zstd

When installing macFUSE for the first time, you'll need to explicitly allow the software in System Preferences / Privacy & Security. It's quite likely that you'll have to reboot after this.
Download a release tarball from the releases page and extract it:

$ wget https://github.com/mhx/dwarfs/releases/download/v0.10.0/dwarfs-0.10.0.tar.xz
$ tar xf dwarfs-0.10.0.tar.xz

Build DwarFS and run its tests:

$ cmake --fresh -B dwarfs-build -S dwarfs-0.10.0 -GNinja -DWITH_TESTS=ON
$ cmake --build dwarfs-build
$ ctest --test-dir dwarfs-build -j

If you don't need the FUSE driver, you can omit macfuse from the brew install and use the following instead of the first cmake command above:

$ cmake --fresh -B dwarfs-build -S dwarfs-0.10.0 -GNinja -DWITH_TESTS=ON -DWITH_FUSE_DRIVER=OFF

To only build the FUSE driver, you can use this instead:

$ cmake --fresh -B dwarfs-build -S dwarfs-0.10.0 -GNinja -DWITH_TESTS=ON -DWITH_LIBDWARFS=OFF -DWITH_TOOLS=OFF

Install DwarFS:

$ sudo cmake --install dwarfs-build

That's it!

Use Cases

Astrophotography

Astrophotography can generate huge amounts of raw image data. During a single night, it's not unlikely to end up with a few dozens of gigabytes of data. With most dedicated astrophotography cameras, this data ends up in the form of FITS images. These are usually uncompressed, don't compress very well with standard compression algorithms, and while there are certain compressed FITS formats, these aren't widely supported.

One of the compression formats (simply called "Rice") compresses reasonably well and is really fast. However, its implementation for compressed FITS has a few drawbacks. The most severe drawbacks are that compression isn't quite as good as it could be for color sensors and sensors with a less than 16 bits of resolution.

DwarFS supports the ricepp (Rice++) compression, which builds on the basic idea of Rice compression, but makes a few enhancements: it compresses color and low bit depth images significantly better and always searches for the optimum solution during compression instead of relying on a heuristic.

Let's look at an example using 129 images (darks, flats and lights) taken with an ASI1600MM camera. Each image is 32 MiB, so a total of 4 GiB of data. Compressing these with the standard fpack tool takes about 16.6 seconds and yields a total output size of 2.2 GiB:

$ time fpack */*.fit */*/*.fit

user	14.992
system	1.592
total	16.616

$ find . -name '*.fz' -print0 | xargs -0 cat | wc -c
2369943360

However, this leaves you with *.fz files that not every application can actually read.

Using DwarFS, here's what we get:

$ mkdwarfs -i ASI1600 -o asi1600-20.dwarfs -S 20 --categorize
I 08:47:47.459077 scanning "ASI1600"
I 08:47:47.491492 assigning directory and link inodes...
I 08:47:47.491560 waiting for background scanners...
I 08:47:47.675241 scanning CPU time: 1.051s
I 08:47:47.675271 finalizing file inodes...
I 08:47:47.675330 saved 0 B / 3.941 GiB in 0/258 duplicate files
I 08:47:47.675360 assigning device inodes...
I 08:47:47.675371 assigning pipe/socket inodes...
I 08:47:47.675381 building metadata...
I 08:47:47.675393 building blocks...
I 08:47:47.675398 saving names and symlinks...
I 08:47:47.675514 updating name and link indices...
I 08:47:47.675796 waiting for segmenting/blockifying to finish...
I 08:47:50.274285 total ordering CPU time: 616.3us
I 08:47:50.274329 total segmenting CPU time: 1.132s
I 08:47:50.279476 saving chunks...
I 08:47:50.279622 saving directories...
I 08:47:50.279674 saving shared files table...
I 08:47:50.280745 saving names table... [1.047ms]
I 08:47:50.280768 saving symlinks table... [743ns]
I 08:47:50.282031 waiting for compression to finish...
I 08:47:50.823924 compressed 3.941 GiB to 1.201 GiB (ratio=0.304825)
I 08:47:50.824280 compression CPU time: 17.92s
I 08:47:50.824316 filesystem created without errors [3.366s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
waiting for block compression to finish
5 dirs, 0/0 soft/hard links, 258/258 files, 0 other
original size: 3.941 GiB, hashed: 315.4 KiB (18 files, 0 B/s)
scanned: 3.941 GiB (258 files, 117.1 GiB/s), categorizing: 0 B/s
saved by deduplication: 0 B (0 files), saved by segmenting: 0 B
filesystem: 3.941 GiB in 4037 blocks (4550 chunks, 516/516 fragments, 258 inodes)
compressed filesystem: 4037 blocks/1.201 GiB written

In less than 3.4 seconds, it compresses the data down to 1.2 GiB, almost half the size of the fpack output.

In addition to saving a lot of disk space, this can also be useful when your data is stored on a NAS. Here's a comparison of the same set of data accessed over a 1 Gb/s network connection, first using the uncompressed raw data:

find /mnt/ASI1600 -name '*.fit' -print0 | xargs -0 -P4 -n1 cat | dd of=/dev/null status=progress
4229012160 bytes (4.2 GB, 3.9 GiB) copied, 36.0455 s, 117 MB/s

And next using a DwarFS image on the same share:

$ dwarfs /mnt/asi1600-20.dwarfs mnt

$ find mnt -name '*.fit' -print0 | xargs -0 -P4 -n1 cat | dd of=/dev/null status=progress
4229012160 bytes (4.2 GB, 3.9 GiB) copied, 14.3681 s, 294 MB/s

That's roughly 2.5 times faster. You can very likely see similar results with slow external hard drives.

Dealing with Bit Rot

Currently, DwarFS has no built-in ability to add recovery information to a file system image. However, for archival purposes, it's a good idea to have such recovery information in order to be able to repair a damaged image.

This is fortunately relatively straightforward using something like par2cmdline:

$ par2create -n1 asi1600-20.dwarfs

This will create two additional files that you can place alongside the image (or on a different storage), as you'll only need them if DwarFS has detected an issue with the file system image. If there's an issue, you can run

$ par2repair asi1600-20.dwarfs

which will very likely be able to recover the image if less than 5% (that's the default used by par2create) of the image are damaged.

Extended Attributes

Preserving Extended Attributes in DwarFS Images

Extended attributes are not currently supported. Any extended attributes stored in the source file system will not currently be preserved when building a DwarFS image using mkdwarfs.

Extended Attributes exposed by the FUSE Driver

That being said, the root inode of a mounted DwarFS image currently exposes one or two extended attributes on Linux:

$ attr -l mnt
Attribute "dwarfs.driver.pid" has a 4 byte value for mnt
Attribute "dwarfs.driver.perfmon" has a 4849 byte value for mnt

The dwarfs.driver.pid attribute simply contains the PID of the DwarFS FUSE driver. The dwarfs.driver.perfmon attribute contains the current results of the performance monitor.

Furthermore, each regular file exposes an attribute dwarfs.inodeinfo with information about the underlying inode:

$ attr -l "05 Disappear.caf"
Attribute "dwarfs.inodeinfo" has a 448 byte value for 05 Disappear.caf

The attribute contains a JSON object with information about the underlying inode:

$ attr -qg dwarfs.inodeinfo "05 Disappear.caf"
{
  "chunks": [
    {
      "block": 2,
      "category": "pcmaudio/metadata",
      "offset": 270976,
      "size": 4096
    },
    {
      "block": 414,
      "category": "pcmaudio/waveform",
      "offset": 37594368,
      "size": 29514492
    },
    {
      "block": 419,
      "category": "pcmaudio/waveform",
      "offset": 0,
      "size": 29385468
    }
  ],
  "gid": 100,
  "mode": 33188,
  "modestring": "----rw-r--r--",
  "uid": 1000
}

This is useful, for example, to check how a particular file is spread across multiple blocks or which categories have been assigned to the file.

Comparison

The SquashFS, xz, lrzip, zpaq and wimlib tests were all done on an 8 core Intel(R) Xeon(R) E-2286M CPU @ 2.40GHz with 64 GiB of RAM.

The Cromfs tests were done with an older version of DwarFS on a 6 core Intel(R) Xeon(R) CPU D-1528 @ 1.90GHz with 64 GiB of RAM.

The EROFS tests were done using DwarFS v0.9.8 and EROFS v1.7.1 on an Intel(R) Core(TM) i9-13900K with 64 GiB of RAM.

The systems were mostly idle during all of the tests.

With SquashFS

The source directory contained 1139 different Perl installations from 284 distinct releases, a total of 47.65 GiB of data in 1,927,501 files and 330,733 directories. The source directory was freshly unpacked from a tar archive to an XFS partition on a 970 EVO Plus 2TB NVME drive, so most of its contents were likely cached.

I'm using the same compression type and compression level for SquashFS that is the default setting for DwarFS:

$ time mksquashfs install perl-install.squashfs -comp zstd -Xcompression-level 22
Parallel mksquashfs: Using 16 processors
Creating 4.0 filesystem on perl-install-zstd.squashfs, block size 131072.
[=========================================================/] 2107401/2107401 100%

Exportable Squashfs 4.0 filesystem, zstd compressed, data block size 131072
        compressed data, compressed metadata, compressed fragments,
        compressed xattrs, compressed ids
        duplicates are removed
Filesystem size 4637597.63 Kbytes (4528.90 Mbytes)
        9.29% of uncompressed filesystem size (49922299.04 Kbytes)
Inode table size 19100802 bytes (18653.13 Kbytes)
        26.06% of uncompressed inode table size (73307702 bytes)
Directory table size 19128340 bytes (18680.02 Kbytes)
        46.28% of uncompressed directory table size (41335540 bytes)
Number of duplicate files found 1780387
Number of inodes 2255794
Number of files 1925061
Number of fragments 28713
Number of symbolic links  0
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 330733
Number of ids (unique uids + gids) 2
Number of uids 1
        mhx (1000)
Number of gids 1
        users (100)

real    32m54.713s
user    501m46.382s
sys     0m58.528s

For DwarFS, I'm sticking to the defaults:

$ time mkdwarfs -i install -o perl-install.dwarfs
I 11:33:33.310931 scanning install
I 11:33:39.026712 waiting for background scanners...
I 11:33:50.681305 assigning directory and link inodes...
I 11:33:50.888441 finding duplicate files...
I 11:34:01.120800 saved 28.2 GiB / 47.65 GiB in 1782826/1927501 duplicate files
I 11:34:01.122608 waiting for inode scanners...
I 11:34:12.839065 assigning device inodes...
I 11:34:12.875520 assigning pipe/socket inodes...
I 11:34:12.910431 building metadata...
I 11:34:12.910524 building blocks...
I 11:34:12.910594 saving names and links...
I 11:34:12.910691 bloom filter size: 32 KiB
I 11:34:12.910760 ordering 144675 inodes using nilsimsa similarity...
I 11:34:12.915555 nilsimsa: depth=20000 (1000), limit=255
I 11:34:13.052525 updating name and link indices...
I 11:34:13.276233 pre-sorted index (660176 name, 366179 path lookups) [360.6ms]
I 11:35:44.039375 144675 inodes ordered [91.13s]
I 11:35:44.041427 waiting for segmenting/blockifying to finish...
I 11:37:38.823902 bloom filter reject rate: 96.017% (TPR=0.244%, lookups=4740563665)
I 11:37:38.823963 segmentation matches: good=454708, bad=6819, total=464247
I 11:37:38.824005 segmentation collisions: L1=0.008%, L2=0.000% [2233254 hashes]
I 11:37:38.824038 saving chunks...
I 11:37:38.860939 saving directories...
I 11:37:41.318747 waiting for compression to finish...
I 11:38:56.046809 compressed 47.65 GiB to 430.9 MiB (ratio=0.00883101)
I 11:38:56.304922 filesystem created without errors [323s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
waiting for block compression to finish
330733 dirs, 0/2440 soft/hard links, 1927501/1927501 files, 0 other
original size: 47.65 GiB, dedupe: 28.2 GiB (1782826 files), segment: 15.19 GiB
filesystem: 4.261 GiB in 273 blocks (319178 chunks, 144675/144675 inodes)
compressed filesystem: 273 blocks/430.9 MiB written [depth: 20000]
█████████████████████████████████████████████████████████████████████████████▏100% |

real    5m23.030s
user    78m7.554s
sys     1m47.968s

So in this comparison, mkdwarfs is more than 6 times faster than mksquashfs, both in terms of CPU time and wall clock time.

$ ll perl-install.*fs
-rw-r--r-- 1 mhx users  447230618 Mar  3 20:28 perl-install.dwarfs
-rw-r--r-- 1 mhx users 4748902400 Mar  3 20:10 perl-install.squashfs

In terms of compression ratio, the DwarFS file system is more than 10 times smaller than the SquashFS file system. With DwarFS, the content has been compressed down to less than 0.9% (!) of its original size. This compression ratio only considers the data stored in the individual files, not the actual disk space used. On the original XFS file system, according to du, the source folder uses 52 GiB, so the DwarFS image actually only uses 0.8% of the original space.

Here's another comparison using lzma compression instead of zstd:

$ time mksquashfs install perl-install-lzma.squashfs -comp lzma

real    13m42.825s
user    205m40.851s
sys     3m29.088s

$ time mkdwarfs -i install -o perl-install-lzma.dwarfs -l9

real    3m43.937s
user    49m45.295s
sys     1m44.550s

$ ll perl-install-lzma.*fs
-rw-r--r-- 1 mhx users  315482627 Mar  3 21:23 perl-install-lzma.dwarfs
-rw-r--r-- 1 mhx users 3838406656 Mar  3 20:50 perl-install-lzma.squashfs

It's immediately obvious that the runs are significantly faster and the resulting images are significantly smaller. Still, mkdwarfs is about 4 times faster and produces and image that's 12 times smaller than the SquashFS image. The DwarFS image is only 0.6% of the original file size.

So, why not use lzma instead of zstd by default? The reason is that lzma is about an order of magnitude slower to decompress than zstd. If you're only accessing data on your compressed filesystem occasionally, this might not be a big deal, but if you use it extensively, zstd will result in better performance.

The comparisons above are not completely fair. mksquashfs by default uses a block size of 128KiB, whereas mkdwarfs uses 16MiB blocks by default, or even 64MiB blocks with -l9. When using identical block sizes for both file systems, the difference, quite expectedly, becomes a lot less dramatic:

$ time mksquashfs install perl-install-lzma-1M.squashfs -comp lzma -b 1M

real    15m43.319s
user    139m24.533s
sys     0m45.132s

$ time mkdwarfs -i install -o perl-install-lzma-1M.dwarfs -l9 -S20 -B3

real    4m25.973s
user    52m15.100s
sys     7m41.889s

$ ll perl-install*.*fs
-rw-r--r-- 1 mhx users  935953866 Mar 13 12:12 perl-install-lzma-1M.dwarfs
-rw-r--r-- 1 mhx users 3407474688 Mar  3 21:54 perl-install-lzma-1M.squashfs

Even this is still not entirely fair, as it uses a feature (-B3) that allows DwarFS to reference file chunks from up to two previous filesystem blocks.

But the point is that this is really where SquashFS tops out, as it doesn't support larger block sizes or back-referencing. And as you'll see below, the larger blocks that DwarFS is using by default don't necessarily negatively impact performance.

DwarFS also features an option to recompress an existing file system with a different compression algorithm. This can be useful as it allows relatively fast experimentation with different algorithms and options without requiring a full rebuild of the file system. For example, recompressing the above file system with the best possible compression (-l 9):

$ time mkdwarfs --recompress -i perl-install.dwarfs -o perl-lzma-re.dwarfs -l9
I 20:28:03.246534 filesystem rewrittenwithout errors [148.3s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
filesystem: 4.261 GiB in 273 blocks (0 chunks, 0 inodes)
compressed filesystem: 273/273 blocks/372.7 MiB written
████████████████████████████████████████████████████████████████████▏100% 

real    2m28.279s
user    37m8.825s
sys     0m43.256s

Expand