My file server is simply a Linux box running Samba and NFS servers, as well as password protected HTTPS so I can easily grab any files I need from remote locations so long as there is internet connectivity. My files are stored on external USB drives, one of which is the master, which is the drive I will actively use, and the second drive is the backup. Every night I have a cron job run a script (backup_files.sh) which uses rsync to mirror the two drives. In addition, the script removes any junk files that may have been left behind on the drive, such as backup files from vi, and metadata that my Mac likes to spew all over the place (which I can't stand!). The script also keeps a log, which is important to check every so often to make sure everything is actually working. Before doing anything, the script makes sure both drives are mounted, so nothing bad happens in the event one of the drives is missing (like the master drive crashing, and the backup script erasing everything on the backup to "synchronize" them). I will describe how this works later.
Now simply having a drive and it's backup is not enough in my opinion. If one drive fails, then you are left with only one good copy of everything. If there is a fire, then both drives are gone. So, every month or so I'll swap the backup drive with another, and I keep the unused drive at a remote location. Now my server and the entire building it is in can be destroyed with everything in it, but at least my files will be safe! In addition to the backup I keep on my home server, I also back up all my files to my computer at work in a separate script (backup_files_remotely.sh). This backup also runs nightly, and executes after the local backup has completed. The script requires the use of an RSA key pair, with the private key on my home server, and the public key in $HOME/.ssh/authorized_keys on my work server.
Below is the main backup script on my home server. Before running a script like the one below for the first time, it is useful to pass the -n option to rsync to do a dry run. This way rsync will tell you exactly what it plans on doing, and this will save you from acidentally removing any files with a broken script. The $@ argument after rsync allows any additional arguments at the command line to be sent directly to rsync (such as -n to test the script before executing it for real) without having to edit the script.
| backup_files.sh |
#!/bin/bash
FILES_DIR="/home/nick/files/"
BACKUP_FILES_DIR="/home/nick/backup_files/"
EXCLUDE="lost+found .identity"
LOG_FILE="/home/nick/logs/backup_files_log.txt"
KEEP_LOG="1" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE="/home/nick/.backup_files_running"
if [ ! -e $TMP_FILE ]; then
touch $TMP_FILE
EXCLUDED=""
for i in $EXCLUDE; do
EXCLUDED="$EXCLUDED --exclude=$i";
done
if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
if [ $KEEP_LOG -eq 2 ]; then
rm -f $LOG_FILE
fi
if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
date | tee -a $LOG_FILE
echo "" | tee -a $LOG_FILE
echo "Removing metadata" | tee -a $LOG_FILE
echo "Searching for .DS_Store Macintosh metadata" | tee -a $LOG_FILE
find $FILES_DIR -name '.DS_Store' -exec rm -fv {} \; | tee -a $LOG_FILE
echo "Searching for ._* Macintosh metadata" | tee -a $LOG_FILE
find $FILES_DIR -name '._*' -exec rm -fv {} \; | tee -a $LOG_FILE
echo "Searching for *~ vi backups" | tee -a $LOG_FILE
find $FILES_DIR -name '*~' -exec rm -fv {} \; | tee -a $LOG_FILE
echo "Removing Mac TemporaryItems" | tee -a $LOG_FILE
if [ -e $FILES_DIR/.TemporaryItems/ ]; then
rm -rfv $FILES_DIR/.TemporaryItems/ | tee -a $LOG_FILE
fi
ID_FILES=`cat $FILES_DIR/.identity`
ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
echo "" | tee -a $LOG_FILE
echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES | tee -a $LOG_FILE
rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR | tee -a $LOG_FILE
ERROR=$?
echo "" | tee -a $LOG_FILE
date | tee -a $LOG_FILE
echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE
rm -f $TMP_FILE
exit $ERROR
else
date | tee -a $LOG_FILE
echo "" | tee -a $LOG_FILE
echo "Drives are not mounted (or no .identity file exists on drive)" | tee -a $LOG_FILE
if [ ! -e $FILES_DIR/.identity ]; then
echo $FILES_DIR "is not mounted" | tee -a $LOG_FILE
fi
if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
echo $BACKUP_FILES_DIR "is not mounted" | tee -a $LOG_FILE
fi
echo "" | tee -a $LOG_FILE
echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE
rm -f $TMP_FILE
exit 3
fi
else
if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
date
echo ""
echo "Removing metadata"
echo "Searching for .DS_Store Macintosh metadata"
find $FILES_DIR -name '.DS_Store' -exec rm -fv {} \;
echo "Searching for ._* Macintosh metadata"
find $FILES_DIR -name '._*' -exec rm -fv {} \;
echo "Searching for *~ vi backups"
find $FILES_DIR -name '*~' -exec rm -fv {} \;
echo "Removing Mac TemporaryItems"
if [ -e $FILES_DIR/.TemporaryItems/ ]; then
rm -rfv $FILES_DIR/.TemporaryItems/
fi
ID_FILES=`cat $FILES_DIR/.identity`
ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
echo ""
echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES
rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR
ERROR=$?
echo ""
date
echo "--------------------------------------------------------------------------------"
rm -f $TMP_FILE
exit $ERROR
else
date
echo ""
echo "Drives are not mounted (or no .identity file exists on drive)"
if [ ! -e $FILES_DIR/.identity ]; then
echo $FILES_DIR "is not mounted"
fi
if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
echo $BACKUP_FILES_DIR "is not mounted"
fi
echo ""
echo "--------------------------------------------------------------------------------"
rm -f $TMP_FILE
exit 3
fi
fi
else
echo "Backup is already running"
exit 2
fi
|
The problem with using USB drives is the same drive may be mapped to a different device after reboot or removal. The way I decided to keep all my drives in order is to tag them with a file named .identity within the root of each drive, with the name of the drive in the contents of the .identity file. This way upon bootup, I can have a script read the .identity file, and decide where that drive needs to be mounted. If a device doesn't have a .identity file, I consider that drive not present (this is how the backup script above checks the drives). Below is the script (mount_drives.sh) I use on my home server for mounting my master drive and one of the backups (note that I have four backup drives all mounting to the same location, but at any given time only one of them will ever be on the system).
| mount_drives.sh |
#!/bin/sh
DRIVE_1_DIR="/home/nick/files/"
DRIVE_1_ID="files0"
DRIVE_2_DIR="/home/nick/backup_files/"
DRIVE_2_ID="files1"
DRIVE_3_DIR="/home/nick/backup_files/"
DRIVE_3_ID="files2"
DRIVE_4_DIR="/home/nick/backup_files/"
DRIVE_4_ID="files3"
DRIVE_5_DIR="/home/nick/backup_files/"
DRIVE_5_ID="files4"
TEMP_MOUNT_DIR="/mnt/"
DRIVES="/dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1"
mount_drives_start() {
for DRIVE in $DRIVES; do
mount -t ext3 $DRIVE $TEMP_MOUNT_DIR
if [ -e $TEMP_MOUNT_DIR/.identity ]; then
ID=`cat $TEMP_MOUNT_DIR/.identity`
echo "The ID for" $DRIVE "is" $ID
umount $TEMP_MOUNT_DIR
if [ $ID == $DRIVE_1_ID ]; then
if [ ! -e $DRIVE_1_DIR/.identity ]; then
echo "Mounting" $DRIVE "to" $DRIVE_1_DIR
mount -t ext3 $DRIVE $DRIVE_1_DIR
else
echo "Something is already mounted to" $DRIVE_1_DIR
echo $DRIVE "will not be mounted"
fi
elif [ $ID == $DRIVE_2_ID ]; then
if [ ! -e $DRIVE_2_DIR/.identity ]; then
echo "Mounting" $DRIVE "to" $DRIVE_2_DIR
mount -t ext3 $DRIVE $DRIVE_2_DIR
else
echo "Something is already mounted to" $DRIVE_2_DIR
echo $DRIVE "will not be mounted"
fi
elif [ $ID == $DRIVE_3_ID ]; then
if [ ! -e $DRIVE_3_DIR/.identity ]; then
echo "Mounting" $DRIVE "to" $DRIVE_3_DIR
mount -t ext3 $DRIVE $DRIVE_3_DIR
else
echo "Something is already mounted to" $DRIVE_3_DIR
echo $DRIVE "will not be mounted"
fi
elif [ $ID == $DRIVE_4_ID ]; then
if [ ! -e $DRIVE_4_DIR/.identity ]; then
echo "Mounting" $DRIVE "to" $DRIVE_4_DIR
mount -t ext3 $DRIVE $DRIVE_4_DIR
else
echo "Something is already mounted to" $DRIVE_4_DIR
echo $DRIVE "will not be mounted"
fi
elif [ $ID == $DRIVE_5_ID ]; then
if [ ! -e $DRIVE_5_DIR/.identity ]; then
echo "Mounting" $DRIVE "to" $DRIVE_5_DIR
mount -t ext3 $DRIVE $DRIVE_5_DIR
else
echo "Something is already mounted to" $DRIVE_5_DIR
echo $DRIVE "will not be mounted"
fi
else
echo "The .identity file does not match any known drive"
echo $DRIVE "will not be mounted"
fi
else
umount $TEMP_MOUNT_DIR
echo $DRIVE "does not exist, or does not have a .identity file"
fi
done
}
mount_drives_stop() {
if [ -e $DRIVE_1_DIR/.identity ]; then
echo "Unmounting" $DRIVE_1_DIR
umount $DRIVE_1_DIR
else
echo "No drive mounted at" $DRIVE_1_DIR", or no .identity file is on drive"
fi
if [ -e $DRIVE_2_DIR/.identity ]; then
echo "Unmounting" $DRIVE_2_DIR
umount $DRIVE_2_DIR
else
echo "No drive mounted at" $DRIVE_2_DIR", or no .identity file is on drive"
fi
if [ -e $DRIVE_3_DIR/.identity ]; then
echo "Unmounting" $DRIVE_3_DIR
umount $DRIVE_3_DIR
else
echo "No drive mounted at" $DRIVE_3_DIR", or no .identity file is on drive"
fi
if [ -e $DRIVE_4_DIR/.identity ]; then
echo "Unmounting" $DRIVE_4_DIR
umount $DRIVE_4_DIR
else
echo "No drive mounted at" $DRIVE_4_DIR", or no .identity file is on drive"
fi
if [ -e $DRIVE_5_DIR/.identity ]; then
echo "Unmounting" $DRIVE_5_DIR
umount $DRIVE_5_DIR
else
echo "No drive mounted at" $DRIVE_5_DIR", or no .identity file is on drive"
fi
}
mount_drives_restart() {
mount_drives_stop
sleep 2
mount_drives_start
}
case "$1" in
'start')
mount_drives_start
;;
'stop')
mount_drives_stop
;;
'restart')
mount_drives_restart
;;
*)
mount_drives_start
esac
|
Below is the script I use for backing up my files to a remote computer (at work).
| backup_files_remotely.sh |
#!/bin/sh
# To backup multiple source dirs into the backup dir, separate dirs with a space and do not end dir paths with a slash
# To copy the contents of the source dir into the backup dir, end with a slash
USERNAME="nick"
SSH_KEY="/home/nick/.ssh/rsa_key"
SOURCE_DIR="/home/nick/files/"
BACKUP_DIR="remote.server_address.net:files/"
EXCLUDE="lost+found .identity"
LOG_FILE="/home/nick/logs/backup_files_remotely_log.txt"
KEEP_LOG="2" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE="/home/nick/.backup_files_remotely_running"
EXCLUDED=""
for i in $EXCLUDE; do
EXCLUDED="$EXCLUDED --exclude=$i";
done
if [ ! -e $TMP_FILE ]; then
touch $TMP_FILE
if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
if [ $KEEP_LOG -eq 2 ]; then
rm -f $LOG_FILE
fi
date | tee -a $LOG_FILE
echo "" | tee -a $LOG_FILE
rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR | tee -a $LOG_FILE
ERROR=$?
echo "" | tee -a $LOG_FILE
date | tee -a $LOG_FILE
echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE
else
date
echo ""
rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR
ERROR=$?
date
echo ""
echo "--------------------------------------------------------------------------------"
fi
rm -f $TMP_FILE
exit $ERROR
else
echo "Backup is already running"
exit 2
fi
|
While rsync generally works well for keeping everything backed up, the only thing rsync actually checks by default are the time files were modified, and the file sizes. During a backup, rsync does a 128-bit MD4 checksum after copying files to make sure they copied correctly. Although I have noticed instances where this seemed to fail, but it could be the data in one of the files mutated sometime between the rsync transfer and when I compared files later on. As a rough estimate, in my own experience I'll have about one file differ for every 500GB copied. If any programs modify the contents of a file, but keep the modified time the same, and the resulting file size happens to stay fixed, rsync will not catch any difference between the original and updated files. To get around this problem, we can force rsync to do file comparisons based on file size and checksums with the -c switch. With this option, a checksum of all files on the sending side will be generated, and checksums will be generated on the receiving side only for files whose file size is the same as that on the sending side. This results in a much slower backup, but at least it will catch some of the holes in the typical but much faster method of archiving. So every few weeks, or before I swap backup drives, I run the script above adding the -c switch to the rsync options. One could also schedule a cron job to run the above script with the -c switch every month. As long as the checksumming version of the backup script runs before the usual one, the nightly backup without checksumming will see that a backup is already running.
Another option is to directly compare the files with diff. Twice a month I have a cron job run a script containing the command below. If a file differs it will let me know, and I will compare the files with another backup (like my backup at work, using md5sum to compare if the files are large), and recopy the file to the drive with the altered copy (and re-check that file with diff to make sure it took).
diff -rq /home/nick/files/ /home/nick/backup_files/ | tee -a /home/nick/backup_diff.txt |
Lastly we come to the computers I actually do work on. While I am actively working on files, I'll keep them in a backup directory on my desktop, which synchronizes itself to a directory on my server every 15 minutes. It is the same script as backup_files_remotely.sh, but set up to run on the computer I am using, with the private key on the local computer, and the public key in the server's $HOME/.ssh/authorized_keys.
So, for the most part I have a fully automated backup system. The only thing I need to do is check the logs once and a while to make sure nothing broke, and swap out my backup drives.