May 24

Website Speed and Downtime

You may have noticed the website running a little slowly over the past few weeks with some "Gateway Timeout" HTTP errors. The 15 minutes downtime at 6pm this evening should have solved that issue.

nicegear runs on a KVM virtual machine (VM) on a dedicated server running Ubuntu which we lease from SiteHost here in New Zealand. On the host server we were seeing load averages spike from 5 to 25 while on the VM the load was running normally at less than 1. At first I thought the reason for the slow down was just high load from increased traffic to nicegear and other websites running on other VMs. This wasn't correct.

After some investigation it turns out that one of the disks in the main RAID volume has been failing. Normally you would see errors from either S.M.A.R.T. or other RAID monitoring services when you have a disk failing, in this instance I didn't spot the issue quickly as there were no errors, just very average write performance which was partially masked by the RAID system.

After some serious debugging I took the step of failing the disk out of the main array and falling back to a single disk. Immediately after doing this the load average on the bare metal host dropped from 15 down underneath 0.5 - which for a quad core system is very healthy.

SiteHost were kind enough to head down to the data centre this evening and replace the dodgy drive. The data is re-syncing across to the new drive at the moment and everything is back running at a nice speed.

0 responses to "Website Speed and Downtime"

Leave a comment