How My Printer Caused Excessive Syscalls & UDP Traffic
5 Nov 2014
3,000 syscalls a second, on an idle machine? That doesn’t seem right. I just booted this machine. The only processes running are those required to boot the SmartOS Global Zone, which is minimal.
Well, at least I have DTrace. I can use this one-liner to figure out what syscalls are being made across the entire system.
Wow! That is a lot of lwp_sigmask
calls.
Now that I know what is being called, it’s time to
find out who is doing the calling? I’ll use another
one-liner to show me the most common user stacks
invoking lwp_sigmask
.
Okay, so this mdnsd code is causing all the trouble. What is the distribution of syscalls for the mdnsd program?
Lots of signal masking and polling. What the hell! Why is it doing this? What is mdnsd anyways? Is there a man page? I switch tabs to my OmniOS machine because I don’t have the whatis database on my SmartOS Global Zone.
I open the dns-sd(1M)
manual. It is a tool for
probing Multicast DNS (mDNS) and DNS Service Discovery.
Googling for mdns reveals that it is used for resolving
host names in small networks, like my home network. It uses
UDP, and requires zero configuration. Nothing obvious to
explain why it’s flipping out. I feel helpless. I turn to
the only thing I can trust, the code.
Woah boy, this is some messy looking code. This would not pass
illumos cstyle checks. Turns out this is code from Darwin—the
kernel of OSX. That explains the references to Safari and
Bonjour in the dns-sd(1M)
page. I move on.
From the previous stack trace, I know
that mDNSPosixRunEventLoopOnce
is the hot
function. It consists of a tight loop, running until
a SIGINT
or SIGTERM
is sent. There
is a timeout
variable passed to this function. I
bet that has something to do with my problem. What is the
distribution of values for this variable?
What the fuck is 4290771112951812
all
about? Is that a bug in the code? I dunno, but it’s not my
concern right now. I’m having the opposite problem, no
sleeping at all. The output indicates a fairly even split
between a timeout of 0 and 5 seconds. I go back and study the
code for a minute, coming away with two observations.
-
The
timeout
is set 0 whengotData
is true. -
Even when
timeout
is set to 5 seconds the call returns immediately. It couldn’t run over 400 times a second otherwise.
I’m going to have to look at the code
for mDNSPosixRunEventLoopOnce
to figure this one
out.
In this function gotData
is renamed
to pDataDispatched
. If select(3C)
finds any ready descriptors this flag will be set to true. At
the end of the function are the lwp_sigmask
(sigprocmask
maps to lwp_sigmask
)
calls that initially caused my investigation.
So MainLoop
calls mDNSPosixRunEventLoopOnce
in a tight loop,
which is essentially just a select call. This explains the
large timeout value, which causes it to act like a blocking
read. When passed a timeout of 0, it acts as a poll. The
only thing to keep this loop from eating the CPU is passing a
non-zero timeout when no data is ready to be read. I already
know that is not happening by virtue of the fact that this
function is executed hundreds of times a second.
Hmmm…I wonder…I wonder if my MacBook is also experiencing abnormal syscall rates? Nooo, that can’t be it. Why would both my SmartOS server and MacBook both have the same problem? There is no good technical reason to link these two. But, then again, I’m dealing with computers here, and I’ve seen a lot of strange things over the years. I switch to my laptop.
What the hell? I ignore Google Chrome Helper as I’m pretty
sure it’s not related. But next in
line, discoveryd
, looks very suspicious. It
doesn’t have mdns in the name, but a) it is making a lot of
syscalls, b) the word discovery has a lot to do with DNS and
c) it ends in ‘d’, indicating it is probably a daemon. Sure
enough, the first google result for “discoveryd” is a thread
stating that mDNSResponder
was replaced
with discoveryd
in OSX Yosemite.
Perhaps Yosemite is causing all my pain? I ask my friend Steve Vinoski to run the same DTrace one-liner on his OSX machines. He has both Yosemite and the older Mountain Lion. But, to my dismay, neither of his machines are exhibiting high syscall rates. My search continues.
Not sure what to do next, I open the OSX Activity Monitor. In desperation I click on the Network tab.
HOLE—E—SHIT! Two-Hundred-and-Seventy Million packets
received by discoveryd
. Obviously I can stop
looking at code and start looking at my network. I hop back
onto my SmartOS machine and check network interface
statistics.
In only a few hours, my idle SmartOS server has received over three million packets. How many packets are coming in every second?
Whatever is causing all this, it is sending about 200 packets
a second. At this point, the only thing left to do is
actually inspect some of these incoming packets. I
run snoop(1M)
to collect events on
the e1000g0
interface, stopping at about 600
events. Then I view the first 15.
A constant stream of mDNS packets arriving from
IP 10.0.1.8
. I know that this IP is not any of
my computers. The only devices left are my iPhone, AppleTV,
and Canon printer.
Wait a minute! The printer! Two days earlier I heard some beeping noises…
I own a Canon PIXMA MG6120 printer. It has a touch interface with a small LCD at the top, used to set various options. Since it sits next to my desk I sometimes lay things on top of it like a book or maybe a plate after I’m done eating. If I lay things in the wrong place it will activate the touch interface and cause repeated pressing. Each press makes a beeping noise. If the object lays there long enough the printer locks up and I have to reboot it. Just such events occurred two days earlier.
I fire up dladm
again to monitor incoming packets
in realtime. Then I turn to the printer. I move all the crap
off of it: two books, an empty plate, and the title for my
Suzuki SV650 that I’ve been meaning to sell for the last year.
I try to use the touch screen on top of the printer. It’s
locked up, as expected. I cut power to the printer and whip
my head back to my terminal.
I run DTrace again to count syscalls.
I’m not sure whether to laugh or cry. I laugh, because, LOL computers. There’s some new dumb shit you deal with everyday, better to roll with the punches and laugh. You live longer that way. At least I got to flex my DTrace muscles a bit. In fact, I felt a bit like Brendan Gregg when he was debugging why OSX was dropping keystrokes.
I didn’t bother to root cause why my printer turned into a UDP machine gun. I don’t intend to either. I have better things to do, and if rebooting solves the problem then I’m happy. Besides, I had to get back to what I was trying to do six hours before I started debugging this damn thing.