A few notes from an NFS debugging session

We were seeing intermittent failures with NFS, particularly when a user would send a large file to the NFS client machine. From that point on, all access to the directory caused the accessing process to hang.

Analysis with tetherial showed that the NFS client was send many retransmits to the NFS server, which was never responding. As it happens, NFS uses UDP by default. Watching both ends of the connection, it became clear that packets were being dropped somewhere inbetween.

The solution to this was to mount NFS with TCP, rather than UDP, since we have no control over the intervening network and its (probably numerous) firewalls. To do this you need to make sure TCP/NFS is configured into your kernel, and then just specify the tcp option to mount.

If you're using automount for home directories and such, you might modify your auto.home file to look something like

--- auto.home   2005-07-15 16:46:57.000000000 +1000
+++ auto.home.new       2005-07-15 16:46:45.000000000 +1000
@@ -1 +1 @@
-*      eisbock.ken.nicta.com.au:/home/&
+*      -tcp    eisbock.ken.nicta.com.au:/home/&

The other solution was to tunnel the NFS connection via SSH, or maybe a VPN.

In summary; if you're asked to debug an unreliable NFS server, checking for UDP packet loss or switching over to TCP is a good place to start.

Control Key Intercepted when using Gimp

Today I was trying to use the 'clone' tool in Gimp and couldn't for the life of me get it to accept the source position with the normal 'control-click'. It seemed like it was being intercepted by something, as it turned into a little cross.

Turns out under Gnome you should go Applications -> Desktop Preferences -> Windows -> Movement Key and modify it to be something else other than the control key; then log out and back in.

the magic of netrc

If you're like me, you probably use ftp every day but have never read the man page. I wish I had though, because it reveals the netrc(5) file for automating logins, and even simple macros.

machine ftp.machine.com
    login ianw
    password password
    macdef upload
           cd /public_html/a/dir
           lcd /var/www/a/dir
           mput blah.txt

machine ftp.machine2.com ...

Actually, I only found out about this because the Python ftplib code mentions it. So you can do something like

from ftplib import FTP
from netrc import netrc

(username,account,password) = netrc().authenticators('ftp.machine.org')

ftp = FTP('ftp.machine.org')
ftp.set_debuglevel(2)
ftp.login(username, password)

It's always nifty re-discovering something that has been around since the dawn of time_t!

Tracking an ABI change

Unfortunately this morning I got hit by a bug where an updated library broke an existing program.

The first thing I noticed was that if I rebuilt the program in question against the new library, everything worked again. This sort of thing points to a (probably unintentional) ABI change.

The source code was large, so I needed to try an zero in on what was happening a bit more. I figured if this was an ABI change, it should show up in the assembly. Thus I created a dump of both binary images with objdump --disassemble.

I then ran that through tkdiff to see where I stood. This showed up about 1500 differences, but looking at them they were mostly

@@ -2138 +2138 @@
-4000000000006dd0:      01 98 81 03 38 24       [MII]       addl r51=7264,r1
+4000000000006dd0:      01 98 01 03 28 24       [MII]       addl r51=5184,r1

As you may know, on IA64 r1 is defined as the gp or global pointer register. Functions aren't just functions on IA64, they have a function descriptor which contains both the function address and a value for the global pointer. The add instruction can take up to a 22 bit operand, so by adding to the global pointer you can offset into a region of 4MB of memory (2:sup:22 = 4MB) directly. When gcc builds your program, it sets r1 to point to the .got section of your binary. Now between the start of the binary and the GOT there is a whole bunch of stuff, notably unwind info, which might push the offsets out. So we can pretty much ignore all of these when looking for the root of our problem.

So a bit more sed and grep gives you a much reduced list of changes, and one in particular jumps out ...

-4000000000051a2c:      04 00 10 90                         mov r38=512
+4000000000051a2c:      24 00 08 90                         mov r38=258

This is where the very handy addr2line comes into play. Running that over the binary gives us

ianw@lime:~/tmp/openssh-3.8.1p1/build-deb$ addr2line --exe ./ssh 4000000000051a2c
../../openbsd-compat/bsd-arc4random.c:60

Peeking at that code

static RC4_KEY rc4;

void arc4random_stir(void)
{
        unsigned char rand_buf[SEED_SIZE];

60-->memset(&rc4, 0, sizeof(rc4));
        if (RAND_bytes(rand_buf, sizeof(rand_buf)) <= 0

 ... blah blah ...

This looks a lot like the sizeof(RC4_KEY) has changed on us. If our library has a different idea about the size of things than we do, it's sure to be a recipe for disaster. A little test program confirms the hypothesis.

#include "openssl/rc4.h"
main(void)
{
        printf("%d\n", sizeof(RC4_KEY));
}

-- 0.9.7e-3 --
ianw@lime:~/tmp$ ./test
258

-- 0.9.7g-1 --
ianw@lime:~/tmp$ ./test
512

Of course, the "what" is the easy bit. Finding out why the size is different is left as an exercise, and a reason why your projects should always keep a ChangeLog in excruciating detail.

Rotate the Y-axis label in gnuplot

As far as I can tell, there is no good way to rotate the Y-axis in gnuplot. Making your own label is a PITA and doesn't look right; if anyone knows how to get the position exactly right I'll add it to this post.

The best I can come up with is to rotate the output in the final postscript.

If you want to do this, you simply add a translate into your output postscript as per below. Just add it to the line above the line that has your Y-axis string starting it.

--- ./contour.eps       2005-05-11 12:12:49.700892853 +1000
+++ contour-ok.eps      2005-05-11 12:12:42.404017942 +1000
@@ -23144,8 +23144,10 @@
 LTb
 6426 2620 M
 gsave 0 setgray
+currentpoint gsave translate 90 rotate 0 0 M
 (Latency \(us\)) Cshow
 grestore
+grestore
 1.000 UP
 grestore % colour palette end
 stroke

Conversion to a bitmap is pretty crappy, but might work for some situations, like say, posting here.

Latency Contour

That's a graph of how packet latency varies with different applied loads (load applied and data gathered via a distributed benchmark we wrote). That thin solid line in the middle shows that by and large, for this particular test latency doesn't vary that much with applied load. The secondary trend is latencies going up overall as the load gets higher.

Xnest for remote Gnome session

One fairly nice way to get a remote gnome session with a minimum of fuss is to use Xnest. This just makes a little X server in a window.

Firstly, setup xauth to allow you through

ianw@lime:~$ echo $DISPLAY
:0.0
ianw@lime:~$ xauth list
localhost.localdomain:0  MIT-MAGIC-COOKIE-1  bd72c2....6c8ab8
lime/unix:0  MIT-MAGIC-COOKIE-1  9a5c6...648d4bffd
localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1  9a5c6b555f...bffd

Copy the cookie that corresponds to your current display to a new one for display :1

ianw@lime:~$ xauth add :1 MIT-MAGIC-COOKIE-1  bd72c2....6c8ab8

Then startup Xnest on display :1

ianw@lime:~$ Xnest -auth .Xauthority :1 &

Then fire up an xterm that ssh's to the remote box and starts up something like gnome-session

ianw@lime:~$ xterm -display :1 -e ssh tutti gnome-session

and hopefully it will look something like this!

Think outside the box

There is a story that goes around UNSW about how a large computer was having constant single bit errors. Every part was replaced but it still kept happening. Eventually, someone realised that the radar tower at the nearby Sydney airport used to sweep directly over the top floor of that building, inducing the errors; a Faraday cage was installed and all was well.

Today I read another interesting problem that came from outside the box; Rob Fowler from Rice University wrote :

I've recently had discussions with several vendors who have mentioned similar magnitudes of disk performance degradation due to the coupling of vibration between cooling fans and disks. This can dramatically increase seek time by keeping the arm from settling. In one case, upgrading a chassis fan caused disk throughput to go down by a factor of 16. The solution is careful attention to vibration damping in mountings.

When you have a problem think outside the box, literally!

Changing the default emacs font

in your .Xdefaults file add a line

emacs*font:6x12

You can replace the 6x12 with pretty much anything you see when clicking shift-left mouse. Make sure you merge your changes with xrdb -merge .Xdefaults.

The problem with adding something like (set-face-font 'default "6x10") to .emacs is that if you start up emacs with -nw for no window mode you'll get errors about not being able to setup the font.