No one got fired for choosing IBM…

… but I should have been, and actually I might well be, deservingly so. Three out of four IBM machines (X335 and X336) crashed horribly during the past months. We are still recovering from two X335 frying the SCSI controller, which in turn b0rked the disks to no avail, and I have been stupid enough to buy a new X336 where we are currently running our main server. The sucker, barely two months old, has been hit by scattered hangs from the beginning, then started crashing every night and is now kernel panicking every few hours, muttering about something wrong during tcp_retransmit.

Thank god (and Emilio, who reminded me about it… my sysadmin skills are so rusty) we have the “panic=30” kernel option which should at least have the sore amount of crap come back to life by itself when stuff hits the fan, however I’m curious to see when we’re going to have the immense privilege of IBM support contacting us, after we called them last thursday. So far, no sign of intelligent life from the other side of the pond apart from a “we’ll call you back shortly” from trained monkeys, so it’s time to wrestle with backups and service migrations.

In any case, rest assured that I’m not going to buy even an IBM mouse pad for a long while. Dear Lazyweb, any suggestions for reliable 1U, 1/2CPU servers with decent support?

Update: the part got replaced today. It took IBM only 8 days and 3 hours to solve the problem. They say the delay was caused by our dealer failing to communicate the correct serial number. 8 days and 3 hours. That’s 195 hours, or 11.700 minutes. Am I supposed to believe such crap? Next time I bet it will be the tooth fairy to blame.

Comments

comments

4 thoughts on “No one got fired for choosing IBM…”

  1. Theo, thanks for the suggestion. The machines are running Ubuntu Hoary, no RH over here. Will try your kernel parameter, but ultimately the problem is IBM support being much less responsive than it should be.

  2. Just curious, what OS/distro are you running on these machines? I have many IBM machines (x335, x326, x445, etc,) that all work terrifically. However, when we tried upgrading from our standard RHEL 3 to RHEL 4, we started hitting a bunch of bugs which caused the machines to crash randomly (uptimes of 2 minutes to several days), which included a tcp retransmit issue. If you’re having a similar issue, the RH work around was to put “net.ipv4.tcp_retrans_collapse = 0” in /etc/sysctl.conf and do a “sysctl -p” to have it take effect. I’d also make sure the bios/fw/etc is up to date. You can find the UpdateXpress CD at http://www-1.ibm.com/support/docview.wss?rs=1201&uid=psg1MIGR-53046&loc=en_US&cs=utf-8&lang=en

    Don’t know if that’ll help, but … :)

Comments are closed.