Computer Troubleshooting

Let's begin with a proactive approach to file system maintenance. What steps should an administrator take to help prevent file system problems from happening in the first place? Here are my seven golden rules on the subject, in no particular order:

1. Upgrade your servers to Windows Server 2003. There's real value in doing this as far as disk maintenance is concerned, for example:

* The chkdsk command in Windows Server 2003 runs a lot faster than the Windows 2000 version of this utility, plus it can fix things like a corrupt Master File Table (MFT) that the previous version of the utility would choke on.
* Powerful new command-line tools like DiskPart.exe, Fsutil.exe and Defrag.exe give you more flexibility for managing disks from the command-line instead of the GUI. These tools can be scripted to automate common disk management tasks you need to perform on a regular basis.
* The new Automated System Recovery (ASR) feature greatly simplifies the task of restoring your system/boot volume in the event of catastrophic disk failure.

2. Use hardware redundancy. RAID 1 disk mirroring lets you recover from catastrophic system volume failure with zero downtime, while RAID 5 is a great way of protecting your data volumes. Windows servers include support for built-in software RAID but you'll get better performance and true hot-swap redundancy by investing more money and buying a hardware RAID controller for your system instead. Don't forget though, keep a few spare drives handy so you can swap them during an emergency—redundancy is useless if you don't have the redundant hardware around to use it. Note that if you do choose to go with the software RAID provided by Windows, mirroring your boot and system volumes requires that these volumes be one and the same i.e. one volume is both your boot volume (contains operating system files) and your system volume (contains hardware-specific boot files).

3. Use a good antivirus program. Viruses can be nasty, and one of the things they can do when they infect a machine is to corrupt the Master Boot Record (MBR) and other critical portions of your hard drives. Not only should you have AV installed on your servers, you should also avoid risky behaviors such as running scripts from untrusted sources, browsing the web, and so on. These are just the kinds of behavior that can lead to infecting your system, so avoid doing things like this on your production servers.

4. Defragment your file systems on a regular basis. This is especially important on servers on which a high number of transactional operations occur as the file systems can quickly become fragmented, dragging down the performance of applications running on your server. To perform a successful defrag you should really have at least 15% free space left on your disk, so make sure you don't let critical system or data disks fill up too much or they'll be harder to maintain. The new command-line Defrag.exe tool of Windows Server 2003 is useful here since you can schedule regular running of this tool during off-hours using the Schtasks.exe command instead of having to defrag manually or buy a third-party defrag tool.

5. Run chkdsk /r on a regular basis. This command finds bad sectors on your disk and tries to fix them by recovering data from them and moving it elsewhere. You can run this command either from a command-prompt window or from the Recovery Console if you can't boot your system normally. Remember that when you try and run chkdsk.exe on your system or boot volume, Windows configures autochk.exe (the boot version of chkdsk.exe) to run at your next reboot. This means you'll need to schedule downtime for your server when you perform this kind of maintenance so that autochk.exe can run.

6. Check your event logs regularly for any disk-related events. Windows sometimes determines on its own when a disk is "dirty" i.e. there are file system errors present on it. In that case, Windows automatically schedules autochk.exe to run at the next reboot, but it also writes an event to the Application log using either the source name "Chkdsk" or "Winlogon". So filter your Application log to view these kinds of events on a regular basis or collect them using Microsoft Operations Manager (MOM) or whatever other systems management tool you use on your network.

7. Back up all your volumes regularly. As a last recourse in the event of a disaster, having working backups of both your system/boot volume and data volumes is critical. ASR in Windows Server 2003 makes backing up the boot/system volume easier, while backing up your data volumes can be done using the Windows Backup (ntbackup.exe) tool or any other backup tool such as one from a third-party vendor. Whatever way you choose to back up your system, do it regularly and verify your backups to ensure you can recover your system using them.

8. (the Platinum rule) If your disk starts to make funny sounds, don't ignore them—do something. Disk failure is often preceded by funny sounds emanating from your computer. These clicking, scraping, screeching, or other types of sounds mean trouble, so when you hear them it's time to make sure you've got a recent backup and a spare disk handy just in case. And it's also time to check your event logs, run chkdsk –r, and use other maintenance and troubleshooting tools to check the health of your disks. Don't ignore these funny sounds!

Seven Golden Rules for Disk Maintenance