Writing [Feed] About Pub

Reflections on “Operating System Support for Database Management”

17 Feb 2014

The abstractions provided by operating systems can hinder the development of efficient databases. OS abstractions are essential for building user applications. It would be highly inefficient, in terms of development time, if every application required the implementation of its own kernel. The OS provides a foundation from which all applications can build upon. However, this general framework does not come free. Performance and the ability to utilize the full capacity of the hardware is traded for a general platform. Like all things in life, there is a tension between two opposing forces. In 1981, Michael Stonebraker wrote about the tension between operating systems and databases Sto81. These are my thoughts.

Your Mental Model Is a Toy

Databases write to disk. This is the most common method for providing Durability; the D in ACID. It is all too easy to conceptualize the path from database to disk as a single hop. If only software was that simple.

The OS provides main memory buffers and virtual memory paging mechanisms for file system operations. Read and write system calls are serviced by these buffers. When a write call is made the disk may be touched but that doesn’t mean the bytes passed to that write call were the ones written to disk. You can make no assumptions about when your data is written to disk. A read call could cause data to be flushed to disk. This non intuitive behavior is an effect of the interal buffers and page management of the OS.

Since main memory is finite the buffers must also be finite. This is why a read call may cause data to be flushed to disk. If the buffer is full and doesn’t contain the page needed to satisfy the read then room must be made in the buffer so that the page may be loaded from disk. A common eviction strategy used is LRU.

25% of the Time, LRU Works All the Time

Least Recently Used is a ubiquitous cache eviction policy. When room needs to be made in the cache the object that has the oldest access time is evicted. If an object is used a lot it stays in the cache. This is called locality of reference. Intuitively, it is a great eviction policy. But at the end of the day the most important property of a cache is the hit rate. Go below a certain percentage and that cache becomes a liability. The hit rate depends on the context in which the cache is used. LRU favors repeated access of common objects. But what is the access pattern of a database?

Stonebraker gives four primary patterns for the INGRES databases:

  1. sequential access to blocks which will not be rereferenced;
  2. sequential access to blocks which will be cyclically rereferenced;
  3. random access to blocks which will not be referenced again;
  4. random access to blocks for which there is a nonzero probability of rereference.

Only number 4 matches the use case for LRU. In any case where the data will not be access again, numbers 1 & 3, there is no benefit to caching, it only adds overhead. Finally, number 2 is the worst case for LRU because it is the same as 1 & 3 except the entire data set is iterated and cached only for the blocks to be evicted before they are re-referenced. Any non-trivial data set is going to be much larger than the buffer causing data to be evicted long before it is referenced again. In fact, for large loops MRU is the better policy as older data will be kept Dar96.

Stonebraker concludes that for the OS buffer management to be more useful it should accept advice from the user application on eviction policy. For example, the man page for posix_fadvise(3C) on my SmartOS machine lists the POSIX_FADV_NOREUSE advice which would be useful in cases 1 & 3 above.

Don’t Put the Cart Before the Horse

Controlling the eviction policy is good for performance, but also important for correctness. Most users expect database transactions to be performed Atomically, the A in ACID. The transaction must completely succeed of fail, there is no in-between. One method for achieving this goal is to use a combination of an intentions list and commit flag. For every transaction three steps are taken to ensure atomicity:

  1. build the intentions list;

  2. set the commit flag;

  3. execute intentions in idempotent manner.

The intentions list is the set of operations that must be performed in order to fulfill the transaction. The commit flag is an indicator that the intentions list is complete and ready to be executed. The final step is to execute all intentions in an idempotent manner. The final step must be idempotent because crashes may cause repeated executions. According to Stonebraker this is achieved by very careful programming. It seems he has a sense of humor.

It is very important that the all intentions flush to disk before the commit flag. If the commit flag flushes to disk while writes to the intentions list are still pending and a crash occurs then only a subset of the user’s transaction will be recovered. This could mean data loss or incomplete updates without any failure indication to the database or user. If the database is to use the OS buffer management then a selected force out should be provided. You could use calls like fsync(3C) but this acts at a file level and may not be granular enough. Even today, over 30 years later, databases often rely on Direct I/O so that the OS buffer management may be bypassed Gre13 §8.3.8.

One char[] to Rule Them All

The abstraction provided by UNIX & Linux file systems is a sequence of bytes (char[]). Databases, however, provide an abstraction where user supplied keys map to records; such as primary and secondary key access in RDBMS. Building these abstractions on top of character arrays can prove inefficient. For example, growing files overtime can cause physical fragmentation on the disk. The database logically performs sequential I/O but gets mapped to random physical I/O. A file system based on extents is more desirable as it keeps blocks physically closer to each other. Sure enough, if you look at the evolution of file systems like UFS, ext3/4, ZFS and btrfs over the last 30 years you’ll see features added to address this issue Gre13 §8.4.5. Features such as I/O clustering, extents, preallocation, and COW (copy-on-write) all contribute to more efficient disk layout.

Trees All the Way Down

Anyone who has done even the most superficial study of database internals knows that the B-tree is a fundamental data structure. A primary use of B-trees is to reduce the number of seeks needed to get a piece of data. The file system is also made up for trees. There is a tree to keep track of directories and files as well as a tree to keep track of the blocks that make up a given file. What you end up with are a collection of distinct trees all giving unique views on the same underlying data. This gives flexibility at the cost of redundancy and overhead. Perhaps there are ways to merge these trees so that one can do the job of many. Performance and efficiency gains might be had if the file system were to provide abstractions like those of the database rather than the char[] abstraction. You could flip it on its head by providing the char[] abstraction on top of database abstractions.

Threads with Threads

Stonebraker spends a significant amount of the article discussing process management and multitasking in databases. At the time of writing the dominant architecture of UNIX programs was multiple OS processes communicating via pipes. The model works conceptually but at a drastic cost in terms of performance. The overhead of communication and task switching proved to be too costly. This ultimately led him to consider the server architecture which consisted of one database OS process shared across all users. However, this model requires implementing multitasking in the database itself which proved both complicated and can still be just as costly as the former model. Ultimately, he laments that both models seem unattractive.

Operating systems today provide intra-process threading which allows for relatively cheap task switching as well as efficient communication via shared memory. However, there is still a lot of effort required to implement the multitasking parts of the database. This causes reimplementation of services that may otherwise be provided by the operating system’s scheduler itself. For example, the Riak database is written in the Erlang programming language which implements its own user-level “threads” and scheduler.

There is still work to be done in this area. Stonebraker mused that it might help if the OS offered a special scheduling class for databases. It could provide fast paths for switching between tasks in the scheduler. This would allow subtasks in a database to yield to each other in a more efficient manner. One might look at Paul Turner’s talk, User-level threads...with threads Tur13, where he discusses this very idea.

Main Memory Is the New Disk

A lot of database technology in the last 3 decades has been focused on providing efficient operating with limited main memory and slow persistent storage. However, the landscape is changing thanks to the decline in memory cost and the invention of SSDs. In section 6, Stonebraker critiques the use of virtual memory. He calims that while it can provide performance gains by avoiding system calls it also introduces more memory pressure to keep track of the page table. The larger the file the more pages that must be tracked.

Given the massive increase in the size of main memory over the years one might be led to believe this is less of an issue today. But they would be wrong. Main memory speed has not kept pace with its increase in density nor the increase in speed of CPUs. Main memory has become the new disk. Around minute 9 of Cliff Click’s fascinating talk, A Crash Course in Modern Hardware, he refers to this problem as the memory wall Cli09. He says that on modern x86 your program performance is going to be blips between cache misses. These cache misses cause thousands of CPU cycles to be lost. What this means is the page table size still matters. For maximum efficiency it should mostly fit in the CPU’s L1/L2 cache.

The Database Operating System

While there are other points covered in Stonebraker’s article the main idea is about building an operating system that can more efficiently support a database. It is about providing primitives that can move logic out of the database and into the kernel for the purposes of better utilizing the hardware and decreasing the redundancy of code. I myself, have also thought about such things while working on the Riak database. For example, I have often wished filters could be sent all the way down to the disk device. This would allow loading only relevant data into main memory and thus make queries more efficient. This is not a new idea, it was tried in the 70s, like all good things were. Another idea is to compile every application into its own standalone, specialized microkernel; as is done in MirageOS Mad14. It is not trivial to extend the operating system. Extreme vigilance must be paid so that the system API does not become too brittle. You want to be general enough to apply to a large set of applications, but at the same time specific enough to reach into the hardware’s guts when needed. I look forward to seeing where we go in the next 30 years.