btrfs, kernel, linux

Understanding disk usage in Linux

How much space is this file taking from my hard drive? How much free space do I have? How many more files can I fit in the remaining free space?

The answer to these questions seems obvious. We all have an instinctive understanding of how filesystems work, and we often picture storing files in disk space in the same way as physically fitting apples inside a basket.

In modern Linux systems though, this intuition can be misleading. Let us see why.

File size

 

What is the size of a file? This one seems easy: the summation of all bytes of content from the beginning of the file until the end of the file.

We often picture all file contents layed out one byte after another until the end of the file.

And that’s how we commonly think about file size. We can get it with

, or the stat command that makes use of the stat() system call.

Inside the Linux kernel, the memory structure that represents the file is the inode. The metadata that we can access through the  stat  command lives in the inode

We can see some familiar attributes, such as the access and modification timestamps, and also we can see  i_size, which is the file size as we defined earlier.

Thinking in terms of the file size is intuitive, but we are more interested in how space is actually used.

Blocks and block size

 

Regarding how the file is stored internally, the filesystem divides storage in blocks. Traditionally the block size was 512 bytes, and more recently 4 kilobytes. This value is chosen based on supported page size for typical MMU hardware.

The filesystem inserts our chunked file into those blocks, and keeps track of them in the metadata.

This ideally looks like this

, but in practice files are constantly created, resized and destroyed, and it looks more like this

This is known as external fragmentation, and traditionally results in a performance degradation due to the fact that the spinning head of the hard drive has to jump around gathering the fragments and that is a slow operation. Classic defragmentation tools try to keep this problem at bay.

What happens with files smaller than 4kiB? what happens with the contents of the last block after we have cut our file into pieces? Naturally there is going to be wasted space there, we call that phenomenon internal fragmentation. Obviously this is an undesirable side effect that can make unusable a lot of free space, much more so when we have a big number of very small files.

We can see the real disk usage of the file with stat, or

, or

For example, the contents of this one byte file still use 4kiB of disk space.

We are therefore looking at two magnitudes, file size and blocks used. We tend to think in terms of the former, but we should think in terms of the latter.

Filesystem specific features

 

In addition to the actual contents of the file, the kernel needs to store all sorts of metadata. We have seen some of the metadata in the inode already, and there also is other that is familiar to any Unix user, such as mode, ownership, uid, gid , flags, and ACL.

There are also other structures such as the superblock that represents the filesystem itself, vfsmount that represents the mountpoint, redundancy information, namespaces and more. Some of this metadata can also take up some significant space, as we’ll see.

Block allocation metadata

This one will highly depend on the filesystem that we are using, as they will keep track of which blocks correspond to a file in their own unique way. The traditional ext2 way of doing this is through the i_block table of direct and indirect blocks.

Image taken from Wikipedia

, which can be found in the following memory structure

As files get bigger, this scheme can produce a huge overhead because we have to track thousands of blocks for a single file. Also, we have a size limitation, as the 32bit ext3 filesystem can handle to only 8TiB files using this mechanism. ext3 developers have been keeping up with the times by supporting 48 bytes, and by introducing extents.

The concept is really simple: allocate contiguous blocks in disk and just anotate where the extent starts and how big it is. This way, we can allocate big groups of blocks to a file using way less metadata, and also benefit from faster sequencial access and locality.

For the curious, ext4 is backwards compatible, so it supports both methods: indirect method and extents method. To see how space is allocated, we can look at a write operation. Writes don’t go straight to storage, but first they land in the file cache for performance reasons. At some point, the cache writes back the information to persistent storage.

The filesystem cache is represented by the struct address_space, and the writepages operation will be called on it. The sequence looks like this

, at which point, ext4_map_blocks()  will call either ext4_ext_map_blocks() , or  ext4_ind_map_blocks()  depending on wether we are using extents or not. If we look at the former in extents.c, we’ll notice references to the notion of holes that we will cover in the next section.

Checksums

The latest generation of filesystems also store checksums for the data blocks, in order to fight against silent data corruption. This gives them the ability to detect and correct these random errors, and of course this also comes with a toll in terms of disk usage proportional to the file size.

Only more modern systems such as BTRFS and ZFS support data checksums, but some older ones like ext4 have included metadata checksums.

The journal

ext3 added journaling capabilities to ext2. The journal is a circular log that records transactions in process in order to provide enhanced resiliance against power failures. By default it only applies to metadata, but it can be enabled as well for the data with the  data=journal  option at some performance cost.

This is a special hidden file, normally at inode number 8 that has a typical size of 128MiB, as the official documentation explains

Introduced in ext3, the ext4 filesystem employs a journal to protect the filesystem against corruption in the case of a system crash. A small continuous region of disk (default 128MiB) is reserved inside the filesystem as a place to land “important” data writes on-disk as quickly as possible. Once the important data transaction is fully written to the disk and flushed from the disk write cache, a record of the data being committed is also written to the journal. At some later point in time, the journal code writes the transactions to their final locations on disk (this could involve a lot of seeking or a lot of small read-write-erases) before erasing the commit record. Should the system crash during the second slow write, the journal can be replayed all the way to the latest commit record, guaranteeing the atomicity of whatever gets written through the journal to the disk. The effect of this is to guarantee that the filesystem does not become stuck midway through a metadata update.

Tail packing

Also known as block suballocation, filesystems with this feature will make use of the tail space at the end of the last block, and share it between different files, effectively packing the tails in a single block.

While this is a nice feature to have that will save us a lot of space specially if we have a big number of small files (as explained above), we can see that it makes existing tools inaccurate to report disk usage. We cannot just add all used blocks of all our files to obtain real disk usage.

Only BTRFS and ReiserFS support this feature.

Sparse files

Most modern filesystems have supported sparse files for a while. Sparse files can have holes in them that are not actually allocated to them and therefore don’t occupy any space. This time, the file size will be bigger than the block usage.

Image taken from Wikipedia

This can be really useful for things like generate “big” files really fast, or to provide free space for our VM virtual hard drive on demand. For the first time, weird things can happen such as end up running out of space in the host while we are using our hard drive in the virtual machine.

In order to slowly create a 10GiB file that uses around 10GiB of disk space we can do

In order to create the same big file instantly we can just write the last byte, or even just

We can also use the truncate command like so

We can modify disk space allocated to a file with the fallocate command that uses the fallocate() system call. With this syscall we can do more advanced things such as

  • Preallocate space for the file inserting zeroes. This will increase both disk usage and file size.
  • Deallocate space. This will dig a hole in the file, thus making it sparse and reducing disk usage without affecting file size.
  • Collapse space, making the file size and usage smaller.
  • Increase file space, by inserting a hole at the end. This increases file size without affecting disk usage.
  • Zero holes. This will make the wholes into unwritten extents so that reads will produce zeroes without affecting space or usage.

For instance, we can dig holes in a file, thus making it sparse in place with

The cp command supports working with sparse files. It tries to detect if the source file is sparse by some simple heuristics and then it makes the destination file sparse as well. We can copy a non-sparse file into a sparse copy with

, or conversely make a solid copy of a sparse file with

If you are convinced that you like working with sparse files, you can add this alias to your terminal environment

When processes read bytes in the hole sections the filesystem will provide zeroed pages to them. For instance, we can analyze what happens when the file cache reads from the filesystem in a hole region in ext4. In this case, the sequence in readpage.c looks something like this

After this, the memory segment that the process is trying to access through the  read()  system call will efficiently obtain zeroes straight from fast memory.

COW filesystems

The next generation of filesystems after the ext family brings some very interesting features. Probably the most game changing feature from filesystems like ZFS or BTRFS is their COW or copy-on-write abilities.

When we perform a copy-on-write operation, or a clone, or a reflink or a shallow copy we are really not duplicanting extents. We are just making a metadata annotation in the newly created file, where we reference the same extents from the original file in the new file and we tag the extent as shared. The userspace is now under the illusion that there are two distinct files that can be modified separatedly. Whenever a process wants to write in a shared extent, the kernel will first create a copy of the extent and annotate it as belonging exclusively to that file, at least for now. After this, both files are a bit more different from one another, but they can still share many extents. In other words, in a COW filesystem extents can be shared between files and the filesystem will be in charge of only creating new extents whenever it is necessary.

We can see that cloning is a very fast operation, that doesn’t require doubling the space that we use like a regular copy. This is really powerful, and it is the technology behind the instant snapshot abilities of BTRFS and ZFS. You can literally clone ( or take a snapshot ) of you whole root filesystem in under a second. This is useful for instance right before upgrading your packages in case something breaks.

BTRFS supports two ways of creating shallow copies. The first one applies to subvolumes and uses the btrfs subvolume snapshot  command. The second one applies to individual files and uses the cp --reflink  command. You can find this alias useful to make fast shallow copies by default.

Going one step further, if we have non shallow copies or a file, or even files with duplicated extents, we can deduplicate them to make them reflink those common extents and free up space. One tool that can be used for this is duperemove but beware that this will naturally lead to a higher file fragmentation.

Now things really start getting complicated if we are trying to discover how our disk is being used by our files. Tools such as  du  or dutree will just count used blocks without being aware that some of them might be shared so will report more space than what is really being used.

Similarly, in BTRFS we should avoid using the df  command as it will report space that is allocated by the BTRFS filesystem as free, so it is better to use btrfs filesystem usage.

 

In order to learn what part of our files is exclusive or shared in BTRFS, we can use btrfs filesystem du . In this example, I want to check how much of my Nextcloud logs is shared between snapshots, which is most of it. Still it is hard to tell how the shared extents are distributed.

 

At the subvolume level we can get a rough idea from tools such as btrfs-du of what amount of data is exclusive to a snapshot and what is shared between snapshots.

References

 

https://en.wikipedia.org/wiki/Comparison_of_file_systems

https://lwn.net/Articles/187321/

https://ext4.wiki.kernel.org/index.php/Main_Page

Author: nachoparker

Humbly sharing things that I find useful [ github dockerhub ]

6 Comments on “Understanding disk usage in Linux

  1. Hola!
    Acabo de conocer tu web, por una “entrevista” que he leido en la web de NextCloud! Mucha y muy buena información sobre GNU/Linux y software libre la que hay por aquí…
    Por cierto proyecto muy interesante el de NextCloudPlus! tampoco lo conocía!

    Saludos.

  2. Jaja yo también he llegado aquí por la entrevista de Nextcloud. Soy un gran fan de Owncloud/Nextcloud y en estos momentos lo tengo corriendo en un Rock64 (4GB) y va que chuta!
    Salu2

    1. Hola,

      Thanks for your feedback. We are close to having armbian based rock64 images, as they just released their stable images 😉

      English from now on, please!

Leave a Reply

Your email address will not be published. Required fields are marked *