Recently at work I spent a few weeks tuning a network service across
three platforms (Solaris, Linux, and AIX) to get within 10% of the
theoretical maximum throughput. In this short article, I’ll walk through
the various tools I used to improve the performance of the application.
This application is very specialized in that the two machines are
connected directly through an ethernet switch. This means that the MTU
could easily be determined from each end of the link and the extra work
to determine the maximum segment size for the transit network (see RFC
1191) was unnecessary. This also
made it very easy to watch the traffic between the two hosts as well as
the system calls they were using to transfer and receive the data.
Before I get into the steps I took to tune the service, I’d like to
introduce the tools used:
- Truss: a tracing utility which displays system calls, dynamically
loaded user level function calls, received signals, and incurred
machine faults. This is available for many platforms, but I use it
most on AIX.
- DTrace/DTruss: a dynamic tracing compiler and tracing utility. This
is an amazingly powerful tool from Sun, originally for Solaris but
slowly spreading to other platforms. See Sun’s How To
Guide.
- strace: a dynamic tracing utility which displays systems calls and
received signals under Linux.
- mpstat: collects and displays performance statistics for all logical
CPUs in the system.
- prstat: iteratively examines all active processes on the system and
reports statistics based on the selected output mode and sort order.
- tcpdump: a utility for capturing network traffic.
- Wireshark: a network protocol analyzer. It replaces the venerable
Ethereal tool and allows you to either capture network traffic on
demand or load a captured session for analysis. Find out more
here.
- gprof: a tool for profiling your code to determine where the
performance bottle-necks are. See the
manual
for more information.
- c++filt: a tool for demangling C++ method names. It is part of the
GNU binutils package.
Since I already had the service up and running, I simply ran the two
components and captured the traffic between them using tcpdump. While
the processes were running, I also used dtruss, truss, or strace
(depending on the platform) to capture the system calls being made.
Since this is a network service, I focused on calls to select
, send
,
and recv
.
13455/15: 2143177 2994 4 pollsys(0xFFFFFD7EBADDB910, 0x1, 0xFFFFFD7EBADDBA30) = 1 0
13455/15: 2143180 5 0 pollsys(0xFFFFFD7EBADDB8D0, 0x1, 0xFFFFFD7EBADDB9F0) = 1 0
13455/15: 2143185 8 4 recvfrom(0x11, 0xB384A0, 0x10000) = 1416 0
13455/15: 2143253 5 0 pollsys(0xFFFFFD7EBADDB8D0, 0x1, 0xFFFFFD7EBADDB9F0) = 0 0
13455/15: 2143262 12 8 send(0x11, 0xB084D0, 0x14B8) = 5304 0
13455/15: 2143268 365 4 pollsys(0xFFFFFD7EBADDB910, 0x1, 0xFFFFFD7EBADDBA30) = 1 0
13455/15: 2143270 4 0 pollsys(0xFFFFFD7EBADDB8D0, 0x1, 0xFFFFFD7EBADDB9F0) = 1 0
13455/15: 2143275 8 4 recvfrom(0x11, 0xB384A0, 0x10000) = 1416 0
13455/15: 2143343 5 0 pollsys(0xFFFFFD7EBADDB8D0, 0x1, 0xFFFFFD7EBADDB9F0) = 0 0
13455/15: 2143348 9 4 send(0x11, 0xB084D0, 0x14B8) = 5304 0
13455/15: 2143353 1000 4 pollsys(0xFFFFFD7EBADDB910, 0x1, 0xFFFFFD7EBADDBA30) = 1 0
Looking at the results above you can see that select
(pollsys
) is
being called each time we need to send or receive data over the network.
Since the socket is non-blocking we can rely on the immediate return
when the outgoing socket buffer is full as well as when there is no data
available to read. By select
ing at the very top of the receive loop we
can bundle multiple receive calls together, increasing the application’s
throughput. Now the output looks like this:
16712/9: 16202 1560 6 pollsys(0xFFFFFD7EBB9DB940, 0x1, 0xFFFFFD7EBB9DBA30) = 1 0
16712/9: 16217 10 6 recv(0xB, 0x8A6450, 0x10000) = 1416 0
16712/9: 16246 9 5 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16267 7 3 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16285 5 1 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16680 10 5 recv(0xB, 0x8A6450, 0x10000) = 1416 0
16712/9: 16712 11 7 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16733 7 3 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16753 6 2 send(0xB, 0x876480, 0x540) = 1344 0
16712/9: 16768 4 0 recv(0xB, 0x8A6450, 0x10000) = -1 Err#11
You’ll notice that now we are able to process two requests and send out
six responses in the time that it previously took to call select and
receive a single request. When there is nothing left to read, the call
to recv
returns errno 11 (EAGAIN
). This change made the single
biggest performance impact on the code. I also changed the calls
recvfrom
to recv
since the application did not make use of the
foreign address.
At this point the performance was much better but I noticed that under
heavy load the sending socket would block as the ratio of requests to
responses was 1:3. As this was a UDP application, having the sending
buffers fill up seemed strange as we assumed that additional packets
would simply be dropped on the floor.
On the server, I checked the UDP socket buffer size using ndd
(this
was under Solaris. For AIX the command is no
and for Linux the command
is sysctl
).
The following code was added to the socket initialize (minus the error
handling) to ensure that the socket buffers were large enough.
unsigned size = 1024 * 1024; // 1MB
int ret = setsockopt(desc, SOL_SOCKET, SO_SNDBUF, &size, sizeof(size));
ret = setsockopt(desc, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));
Now that the application was performing acceptably I decided to run it
under the profiler. This turned up the function which was adding
responses to the in-memory packet. It turned out that as responses were
being added to the packet, the headers were being recalculated each
time. I removed this unnecessary work and only made the calculations
right before the packet was sent. This improved performance a few
percentage points more.
By binding the network interrupts to a particular core and keeping the
sending thread off of that core we were able to eek out additional
performance from the application. To accomplish this, the application
allows the operator to specify which core(s) it should bind to using
sched_setaffinity
(Linux) and processor_bind
(Solaris). You can also
accomplish this using taskset
(Linux) and pbind (Solaris) if you don’t
wish to modify your application.
Looking at the network traffic with tcpdump, I saw that I could fit an
additional response in the response bundle packet if I reduced or
removed some of the items in the packet header. At this point the
analysis and tuning had gone on for a few weeks and we had a schedule to
meet. Since the performance was where we needed it, the application was
wrapped up and sent to quality assurance.
The single most important lesson I learned from this exercise was to use
non-blocking sockets to their fullest by continually calling
recv
/send
until the call would block and then using select
to idle
the process until there is work to do.