Debugging a Zig Test Failure
Sep 25, 2023
As a computer programmer what do you do when faced with an unfamiliar error? Do you head for Google in hopes that some kind soul before you has made a sacrifice at the Altar of Shannon? Do you begin on a hero’s journey of printf statements? Perhaps you’re an intellectual who debugs from first principles by talking to yourself in the shower. No, you’re too sophisticated for that, your coffees are pour overs and your debugging is done in a debugger — you step through each line as meticulously as you weighed your 10g of coffee per 180ml of water this morning. Or finally, maybe you just “don’t have time for this shit” so you do a drive by on some poor schmuck’s issue tracker with the most vague report possible never to return again, officially making this “someone else’s problem”. Look, I get it, we’ve all been there. Sometimes you are 11 bugs deep and you just need this one win, and you need it right now.
But what if I told you there is another way? What if the art of debugging isn’t one of those things but a combination of all of them? What if I told you the key to good debugging is a rabid curiosity combined with the ability to ask the system questions about itself. Let me give you a glimpse of this world in the context of a recent error I debugged while standing-up zig-0.11.0 on illumos.
Here we have one of Zig’s standard library tests reporting a
failure. From the output alone we can surmise that the failure
is with the
mkdiratZ() function and that the test
has something to do with checking “max file name component
lengths”. The error
NameTooLong points to
something being larger than expected.
From here there are many places we could choose to visit next. Where to go next is the art of debugging and depends largely on your existing knowledge of “the system”. By “the system” I mean many things all at once: Zig, this test, the standard library, the operating system, the hardware it’s running on, etc. Debugging starts with your current body of knowledge and ends by answering the question “what is happening”. Between those two points you may have to answer many other questions. Many times those questions will take you outside the bounds of your knowledge. The trick is to gently expand your horizon until it sheds light on the ultimate question you want answered. You do this with a rabid curiosity along with the ability to ask questions of the system.
When I first saw this test failure my existing body of knowledge clued me in to some basic facts about this error.
mkdiratZ()function is probably a wrapper around the
mkdiratsystem call. I know this because I’ve spent years working on the illumos kernel and have built up some knowledge on operating systems.
NameTooLongfailure is probably Zig’s symbol for the operating system’s
ENAMETOOLONGcode found in
Given these two facts, the Zig stdlib disagrees with the
operating system on some maximum value related to path names
Oftentimes the first question I like to answer is “what
sequence of steps led to the failure”? Just like we refer to
software as “high level” and “low level”, we can answer a
question like this at ever deepening levels of detail. For
NameTooLong failure happens when I
run Zig’s stdlib test suite. Going another level down I know
it happens when I run the
max file name component
lengths test. Another level down I know it happens when
mkdiratZ() function. Repeating
this process eventually brings you to some bedrock upon which
your overall understanding can rest. First you focus your eyes
to the trees, then you unfocus them to the forest.
In this case Zig was nice enough to provide a stacktrace of
the events leading up to the failure. At the top of the stack
mkdiratZ() call. After that we cross the
system call boundary where things are handed off to the
operating system; but I’m not worried about that just yet. I
want to gather more data on what Zig is doing first. From the
bottom of the stack we see the
testFilenameLimits() is called along
with where that function is defined in the source. I’d like to
go another level deeper and look at the test source to see
what I can infer about this test.
documentation I know that the first line is using
the array multiplication operator
build an array of repeating
1 bytes of
std.fs.MAX_NAME_BYTES. This array is the
input to our failing test
testFilenameLimits(). It’s not a
terrible bet that
maxed_ascii_filename is the
same path that is passed to
mkdirat. I’ll make
that assumption for now while keeping in mind that I could be
wrong. Next I want to find the code that
The code for
std.fs.MAX_NAME_BYTES provides more
- This upper bound has to do with “file name components”.
Other systems use the constant
NAME_MAX, but illumos (currently tagged under
.solaris) uses this oddly named
MAXNAMLEN. Yes, that says
NAME. Anytime you find a missing vowel it’s a good sign you might be dealing with some 40 year old Unix anachronism.
At this point, based on my hunch that Zig is calling into
mkdirat system call, I feel compelled to read
its man page.
Man pages document the contract made between you and some
other part of the system. They are paramount to understanding
your operating system. Here is an abbreviated version of
ENAMETOOLONG error we see in the Zig
stacktrace. There’s also mention of the
constant used by other systems in the Zig code. My hunch about
a mismatch in max length is looking stronger and now is a good
time to verify it by asking the operating system what it sees
when it runs the test.
Here I’ve used
truss to trace
mkdirat calls made by the Zig test process or
any of its spawned children. And sure enough we see a call
path consisting of a long sequence of "1"
bytes which ultimately fails with
errno.h). Based on what
mkdirat(2) man page says we should
ENAMETOOLONG when the path component
NAME_MAX, but the Zig code for illumos is
using some value named
would seem that Zig’s value is greater than the operating
system’s, leading me to two new questions.
os.system.MAXNAMLENcome from and what is its value?
NAME_MAXcome from and what is its value?
There are several ways to discover the value
os.system.MAXNAMLEN. The easiest of which is
to search for the symbol in the source. This is also a good
excuse to get more familiar with Zig by manually tracking
through the code a bit.
Finding the definition of
os.system is easy, but
you might be confused on how exactly it resolves given the
code above. Let’s stake a step back.
Zig is a programming language. Programming languages ship with a standard library that allows them to perform useful routines, some of which require access to the underlying “system”. In most cases that system is the operating system, but it could just as well be a “freestanding” target. A freestanding target is how one targets an embedded environment or writes their own operating system. But the most typical situation is one where the system is the operating system you are running on.
Not all operating systems are created equal, and how a userland program interacts with the operating system varies. For Linux, the API/ABI is the system call table. Linux does not ship a libc or any other system library or utility for that matter. This is the meaning behind the meme “I think you mean GNU/Linux”. Other operating systems make different choices. For example, FreeBSD gives you everything: kernel, libc, system libraries and tools, and a POSIX environment — it stands on its own. Illumos sits in the middle, it provides the kernel, libc, system libraries, system utilities, and most of the POSIX environment; but illumos itself is not an installable operating system. Rather, like Linux, it has distros which fill in the blanks. I’m running a distro called OmniOS which is geared towards sever installs. Unlike Linux, the illumos API/ABI is libc, NOT the system call table. Not only that, but there is no static libc in illumos, you must link to it dynamically. In Linux you can choose from multiple libc providers, or you can go direct to the system call table — it’s up to you.
Going back to the Zig comment above, it should be more clear
what they mean by “system”. Basically, Zig gives you the
opportunity to completely redefine the system, link to libc, or
to utilize its built-in implementation for the system. By the
way, if you look at
builtin.zig you’ll see there
link_libc field defined; that’s because
it’s generated on-the-fly as part of compilation via
function. It’s set to
true if the underlying
system requires linking libc (like illumos) or if the
programmer specifically demands it in the
build.zig file. You can view these
generated values by running
On illumos the
system symbol resolves
std.c.solaris. In there we
MAXNAMLEN defined with a value
511. The odd number feels bit weird as I would
512. Is this an off-by-one error?
That can’t be because
511 is less
512; so we would expect to never reach the
limit. Rather than guess I should look at the definition for this
constant. Normally I would
cscope a checkout of
source, but I can also just grep the header files
installed on my system.
So Zig thinks the value is
511 and the system
512. This makes sense because in Zig
you prefer to pass a slice (
const u8) which
includes the length and has no need for NUL termination. Zig
wants to keep 1 byte in reserve to properly NUL-terminate the
string before passing it to the system. What about the value
NAME_MAX constant that the man page
The value Zig is using is greater than the value enforced by
the system. Now is the time to find the smoking gun, to track
down the precise location where the system enforces this limit
and generates the error. But how do we do that
mkdirat is a system call? Tools like truss
cannot help us here as they only trace things from the
userland perspective; we need a view into the kernel. We could
cscope the kernel code, but in cases like this it can often be
ambiguous if we are looking at the correct code location or
not. What we’d really like is the ability to place a dynamic
printf in each kernel function indicating when it enters and
returns, along with its return value. We could create a custom
kernel fit for this purpose but it would be a massive
undertaking and not a good use of our time. Ideally there
should exist some system tool to perform such a feat
on-the-fly without any additional installation or compiling of
Magical Dynamic printf
Pretend with me for a moment, that we have a magical scripting language that lets us trace functions on-the-fly. This magical language is going to look something like AWK, where we have a sequence of patterns with optional predicates, and a block of associated actions.
Let’s start with a script that prints all kernel function entries and returns along with the return value.
The probe patterns consist of
sequence. In this case the provider is
stands for Function Boundary Tracing. It allows us to trace
all function entry/return points in the kernel.
probefunc variable is the name of the
function, and the
probename variable indicates
entry or return. So
fbt:*:entry traces all kernel
function entry probes; and from that you should be able to
fbt:*:return does. We grab the return
value from the variable
retval. These variables
are “built-in”, meaning they are available implicitly and
their value is determined by the context they are referenced
This is a good start, but we have a major problem: this
traces all kernel functions all of the time. I
want to add two additional features to limit probe firing only
to when they are in service to a
syscall provider allows us to trace
entry/return points for a given system call. We could use
fbt provider for the same purpose, but due to
the 50 years of Unix history in illumos the kernel function
names don’t always match the system call names as they are
presented to users.
self->trace declaration is a thread-local
variable; it retains its value across probe firings as long as
they happen in the context of the same thread. Said another
way, the variable exists only in the thread that set it giving
us a filtering mechanism to trace only the calls made in
service to this thread. We set it to
1 to use as
a truthy value in a probe’s predicate. We set it
0 as a falsey value to disable the tracing.
With these changes we now print only function calls made in
service to a
mkdirat system call. This greatly
reduces the output, but it could still be noisy as it
mkdirat calls made
by all processes on the system. It would help if we
could filter it further to only calls from a specific
executable, but even then Zig might make quite a
mkdirat calls, especially when running a test
suite. What we really want is the ability to trace the
sequence of kernel function calls only when
mkdirat call results in a failure.
One way to do this is to use the script above and then
post-process the output. That requires bringing in another
tool and an additional step to get the information I want.
Perhaps the magic scripting language could support this use
case directly? What if the
printf output could be
written to a “speculative” buffer? And what if we could delay
the decision to print the contents of that speculative buffer
until the return of the
mkdirat call and base
that decision on the return value?
Upon entry to a
mkdirat we create a new
speculation buffer local to the running thread. As fbt probes
fire we check for an active speculation and perform a
printf to its buffer. Finally, on return
mkdirat, we decide to
discard the buffer based on the return value.
At this point you might think I’m a bit crazy to propose such an advanced tool — that’s some great vaporware you got there Ryan. But it’s not a proposal, merely a watered down description of a language that has existed for 20 years.
What I’ve just described is a small subset of what is possible in DTrace. DTrace is an AWK-like scripting language that allows you to ask questions about almost any aspect of the system in a dynamic and production-safe manner. It’s available on all illumos-based systems as well as Solaris, macOS, FreeBSD, Windows, and other systems. On illumos it doesn’t require any special kernel modules or compilation — it’s there waiting and ready for the moment you need it. The differences between my make-believe script and the real one are mostly in syntax.
The pragmas at the top configure
from within the script itself instead of having to pass them
on the command line each time. The
flowinentoption makes the output easier to read by automatically indenting output based on function entry/return. I also increase the size of the speculative tracing buffer to make sure there is enough room to print all the function calls. Likewise the maximum string size is increased to account for Zig’s 512-byte path.
syscallprobe fires in user context but DTrace actions run in kernel context. The
copyinstr()routine copies a C string from user to kernel space.
variables. I use
errnoto determine if the
Instead of a
retvalvariable you get the return value of an FBT return probe via the
The script produced the following output.
The trick here is start at the bottom and trace the origin of
78 (ENAMETOOLONG) value. We end up
pn_getcomponent() function which comes from
the "Path Name utilities" code found in pathname.c.
And now we have a third constant:
Notice that this one has a vowel where the one defined in Zig
does not. It’s defined in
sys/param.h which is
where various system parameters are kept.
And with that I have my smoking gun. The illumos kernel sets a
256 bytes, including
for a component name. Zig, on the other hand, believes
512. So where did
(notice the missing vowel again) come from and does it ever
come into play?
#if pre-processor macro that comes before
the definition is important. It states that this constant is
exposed only when asking for extensions or when we are not
compiling for an XOPEN/POSIX environment. This has to do with
the enforcement of standards and you can read more about this
man page. The section on “Feature Test Macros” speaks a bit to
__EXTENSIONS__ check. But the main takeaway
is that extensions define features outside of a conforming
MAXNAMLEN, MAXNAMELEN, and NAME_MAX
So what’s the correct constant to use? Technically speaking,
none of them. The maximum component length is up to the
filesystem and should be queried
This system call and its related
NAME_MAX were introduced way back in
POSIX-1988. However, as I learned
there is a history of using
MAXNAMLEN in the
I did some spelunking in
and found that BSD 4.3 Reno defined
dir(5) page referred
MAXNAMLEN and its VFS lookup function used
that constant to enforce component length. BSD 4.4
pathconf and modified the VFS layer
MAXNAMLEN. In fact, to this day
MAXNAMLEN. So how did we end up
with the wrong value? Fifty years of Unix history and a lot of
confusion, that’s how.
The fix here is
the Zig code to use
NAME_MAX for illumos just
like the other platforms. After that the test passes as
Request for Questions
Did you find this interesting? Do you have a question about
your system that you don’t know how to answer? If so, send me
with your question and the tag
RFQ in the subject
line. If I have some insight to offer I will. If I think the
question is fun I might even write a blog post on it.