(Mostly) Automatic File Backups in Linux
My philosophy for general file storage (or at least what works for me) is to have a centralized server, where I store all files that have any worth in keeping, and to have regular backups. I never store any files on my computers, unless I am actively working on them (and even then, I usually have a cron job rsync the files to the server every few minutes). Any one of my computers can crash and burn, and it will be no big deal (in fact one of my laptops completely died two days ago... No worry!). Even if my server dies I won't have much downtime! I will describe my method for keeping backups in detail below, and hopefully I can rub off some ideas on you. The scripts I use are fairly specific to my needs, but I think they are still generic enough that they may be easily modified to suit the needs of others. I assume the reader knows enough that they can figure out what the scripts do on their own, but I am willing to assist anyone if they need help getting one of the scripts to work for their needs (just don't expect me to write a new script for you).

My file server is simply a Linux box running Samba and NFS servers, as well as password protected HTTPS so I can easily grab any files I need from remote locations so long as there is internet connectivity. My files are stored on external USB drives, one of which is the master, which is the drive I will actively use, and the second drive is the backup. Every night I have a cron job run a script (backup_files.sh) which uses rsync to mirror the two drives. In addition, the script removes any junk files that may have been left behind on the drive, such as backup files from vi, and metadata that my Mac likes to spew all over the place (which I can't stand!). The script also keeps a log, which is important to check every so often to make sure everything is actually working. Before doing anything, the script makes sure both drives are mounted, so nothing bad happens in the event one of the drives is missing (like the master drive crashing, and the backup script erasing everything on the backup to "synchronize" them). I will describe how this works later.

Now simply having a drive and it's backup is not enough in my opinion. If one drive fails, then you are left with only one good copy of everything. If there is a fire, then both drives are gone. So, every month or so I'll swap the backup drive with another, and I keep the unused drive at a remote location. Now my server and the entire building it is in can be destroyed with everything in it, but at least my files will be safe! In addition to the backup I keep on my home server, I also back up all my files to my computer at work in a separate script (backup_files_remotely.sh). This backup also runs nightly, and executes after the local backup has completed. The script requires the use of an RSA key pair, with the private key on my home server, and the public key in $HOME/.ssh/authorized_keys on my work server.

Below is the main backup script on my home server. Before running a script like the one below for the first time, it is useful to pass the -n option to rsync to do a dry run. This way rsync will tell you exactly what it plans on doing, and this will save you from acidentally removing any files with a broken script. The $@ argument after rsync allows any additional arguments at the command line to be sent directly to rsync (such as -n to test the script before executing it for real) without having to edit the script.

backup_files.sh
#!/bin/bash
FILES_DIR="/home/nick/files/"
BACKUP_FILES_DIR="/home/nick/backup_files/"
EXCLUDE="lost+found .identity"
LOG_FILE="/home/nick/logs/backup_files_log.txt"
KEEP_LOG="1" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE="/home/nick/.backup_files_running"

if [ ! -e $TMP_FILE ]; then
  touch $TMP_FILE

  EXCLUDED=""
  for i in $EXCLUDE; do
    EXCLUDED="$EXCLUDED --exclude=$i";
  done

  if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
    if [ $KEEP_LOG -eq 2 ]; then
      rm -f $LOG_FILE
    fi
    if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
      date | tee -a $LOG_FILE

      echo "" | tee -a $LOG_FILE
      echo "Removing metadata" | tee -a $LOG_FILE
      echo "Searching for .DS_Store Macintosh metadata" | tee -a $LOG_FILE
      find $FILES_DIR -name '.DS_Store' -exec rm -fv {} \; | tee -a $LOG_FILE
      echo "Searching for ._* Macintosh metadata" | tee -a $LOG_FILE
      find $FILES_DIR -name '._*' -exec rm -fv {} \; | tee -a $LOG_FILE
      echo "Searching for *~ vi backups" | tee -a $LOG_FILE
      find $FILES_DIR -name '*~' -exec rm -fv {} \; | tee -a $LOG_FILE
      echo "Removing Mac TemporaryItems" | tee -a $LOG_FILE
      if [ -e $FILES_DIR/.TemporaryItems/ ]; then
        rm -rfv $FILES_DIR/.TemporaryItems/ | tee -a $LOG_FILE
      fi

      ID_FILES=`cat $FILES_DIR/.identity`
      ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
      echo "" | tee -a $LOG_FILE
      echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES | tee -a $LOG_FILE
      rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR | tee -a $LOG_FILE
      ERROR=$?

      echo "" | tee -a $LOG_FILE
      date | tee -a $LOG_FILE
      echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE

      rm -f $TMP_FILE
      exit $ERROR
    else
      date | tee -a $LOG_FILE

      echo "" | tee -a $LOG_FILE
      echo "Drives are not mounted (or no .identity file exists on drive)" | tee -a $LOG_FILE 
      if [ ! -e $FILES_DIR/.identity ]; then
        echo $FILES_DIR "is not mounted" | tee -a $LOG_FILE
      fi
      if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
        echo $BACKUP_FILES_DIR "is not mounted" | tee -a $LOG_FILE
      fi

      echo "" | tee -a $LOG_FILE
      echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE

      rm -f $TMP_FILE
      exit 3
    fi
  else
    if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
      date

      echo ""
      echo "Removing metadata"
      echo "Searching for .DS_Store Macintosh metadata"
      find $FILES_DIR -name '.DS_Store' -exec rm -fv {} \;
      echo "Searching for ._* Macintosh metadata"
      find $FILES_DIR -name '._*' -exec rm -fv {} \;
      echo "Searching for *~ vi backups"
      find $FILES_DIR -name '*~' -exec rm -fv {} \;
      echo "Removing Mac TemporaryItems"
      if [ -e $FILES_DIR/.TemporaryItems/ ]; then
        rm -rfv $FILES_DIR/.TemporaryItems/
      fi

      ID_FILES=`cat $FILES_DIR/.identity`
      ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
      echo ""
      echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES
      rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR
      ERROR=$?

      echo ""
      date
      echo "--------------------------------------------------------------------------------"

      rm -f $TMP_FILE
      exit $ERROR
    else
      date

      echo ""
      echo "Drives are not mounted (or no .identity file exists on drive)"
      if [ ! -e $FILES_DIR/.identity ]; then
        echo $FILES_DIR "is not mounted"
      fi
      if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
        echo $BACKUP_FILES_DIR "is not mounted"
      fi

      echo ""
      echo "--------------------------------------------------------------------------------"

      rm -f $TMP_FILE
      exit 3
    fi
  fi
else
  echo "Backup is already running"
  exit 2
fi

The problem with using USB drives is the same drive may be mapped to a different device after reboot or removal. The way I decided to keep all my drives in order is to tag them with a file named .identity within the root of each drive, with the name of the drive in the contents of the .identity file. This way upon bootup, I can have a script read the .identity file, and decide where that drive needs to be mounted. If a device doesn't have a .identity file, I consider that drive not present (this is how the backup script above checks the drives). Below is the script (mount_drives.sh) I use on my home server for mounting my master drive and one of the backups (note that I have four backup drives all mounting to the same location, but at any given time only one of them will ever be on the system).

mount_drives.sh
#!/bin/sh
DRIVE_1_DIR="/home/nick/files/"
DRIVE_1_ID="files0"
DRIVE_2_DIR="/home/nick/backup_files/"
DRIVE_2_ID="files1"
DRIVE_3_DIR="/home/nick/backup_files/"
DRIVE_3_ID="files2"
DRIVE_4_DIR="/home/nick/backup_files/"
DRIVE_4_ID="files3"
DRIVE_5_DIR="/home/nick/backup_files/"
DRIVE_5_ID="files4"
TEMP_MOUNT_DIR="/mnt/"
DRIVES="/dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1"

mount_drives_start() {
  for DRIVE in $DRIVES; do
    mount -t ext3 $DRIVE $TEMP_MOUNT_DIR
    if [ -e $TEMP_MOUNT_DIR/.identity ]; then
      ID=`cat $TEMP_MOUNT_DIR/.identity`
      echo "The ID for" $DRIVE "is" $ID
      umount $TEMP_MOUNT_DIR
      if [ $ID == $DRIVE_1_ID ]; then
        if [ ! -e $DRIVE_1_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_1_DIR
          mount -t ext3 $DRIVE $DRIVE_1_DIR
        else
          echo "Something is already mounted to" $DRIVE_1_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_2_ID  ]; then
        if [ ! -e $DRIVE_2_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_2_DIR
          mount -t ext3 $DRIVE $DRIVE_2_DIR
        else
          echo "Something is already mounted to" $DRIVE_2_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_3_ID  ]; then
        if [ ! -e $DRIVE_3_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_3_DIR
          mount -t ext3 $DRIVE $DRIVE_3_DIR
        else
          echo "Something is already mounted to" $DRIVE_3_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_4_ID  ]; then
        if [ ! -e $DRIVE_4_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_4_DIR
          mount -t ext3 $DRIVE $DRIVE_4_DIR
        else
          echo "Something is already mounted to" $DRIVE_4_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_5_ID  ]; then
        if [ ! -e $DRIVE_5_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_5_DIR
          mount -t ext3 $DRIVE $DRIVE_5_DIR
        else
          echo "Something is already mounted to" $DRIVE_5_DIR
          echo $DRIVE "will not be mounted"
        fi
      else
        echo "The .identity file does not match any known drive"
        echo $DRIVE "will not be mounted"
      fi
    else
      umount $TEMP_MOUNT_DIR
      echo $DRIVE "does not exist, or does not have a .identity file"
    fi
  done
}

mount_drives_stop() {
  if [ -e $DRIVE_1_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_1_DIR
    umount $DRIVE_1_DIR
  else
    echo "No drive mounted at" $DRIVE_1_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_2_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_2_DIR
    umount $DRIVE_2_DIR
  else
    echo "No drive mounted at" $DRIVE_2_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_3_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_3_DIR
    umount $DRIVE_3_DIR
  else
    echo "No drive mounted at" $DRIVE_3_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_4_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_4_DIR
    umount $DRIVE_4_DIR
  else
    echo "No drive mounted at" $DRIVE_4_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_5_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_5_DIR
    umount $DRIVE_5_DIR
  else
    echo "No drive mounted at" $DRIVE_5_DIR", or no .identity file is on drive"
  fi
}

mount_drives_restart() {
  mount_drives_stop
  sleep 2
  mount_drives_start
}

case "$1" in
'start')
  mount_drives_start
  ;;
'stop')
  mount_drives_stop
  ;;
'restart')
  mount_drives_restart
  ;;
*)
  mount_drives_start
esac

Below is the script I use for backing up my files to a remote computer (at work).

backup_files_remotely.sh
#!/bin/sh
# To backup multiple source dirs into the backup dir, separate dirs with a space and do not end dir paths with a slash
# To copy the contents of the source dir into the backup dir, end with a slash
USERNAME="nick"
SSH_KEY="/home/nick/.ssh/rsa_key"
SOURCE_DIR="/home/nick/files/"
BACKUP_DIR="remote.server_address.net:files/"
EXCLUDE="lost+found .identity"
LOG_FILE="/home/nick/logs/backup_files_remotely_log.txt"
KEEP_LOG="2" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE="/home/nick/.backup_files_remotely_running"

EXCLUDED=""
for i in $EXCLUDE; do
  EXCLUDED="$EXCLUDED --exclude=$i";
done

if [ ! -e $TMP_FILE ]; then
  touch $TMP_FILE
  if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
    if [ $KEEP_LOG -eq 2 ]; then
      rm -f $LOG_FILE
    fi
    date | tee -a $LOG_FILE

    echo "" | tee -a $LOG_FILE
    rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR | tee -a $LOG_FILE
    ERROR=$?

    echo "" | tee -a $LOG_FILE
    date | tee -a $LOG_FILE
    echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE
  else
    date

    echo ""
    rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR
    ERROR=$?
    date

    echo ""
    echo "--------------------------------------------------------------------------------"
  fi
  rm -f $TMP_FILE
  exit $ERROR
else
  echo "Backup is already running"
  exit 2
fi

While rsync generally works well for keeping everything backed up, the only thing rsync actually checks by default are the time files were modified, and the file sizes. During a backup, rsync does a 128-bit MD4 checksum after copying files to make sure they copied correctly. Although I have noticed instances where this seemed to fail, but it could be the data in one of the files mutated sometime between the rsync transfer and when I compared files later on. As a rough estimate, in my own experience I'll have about one file differ for every 500GB copied. If any programs modify the contents of a file, but keep the modified time the same, and the resulting file size happens to stay fixed, rsync will not catch any difference between the original and updated files. To get around this problem, we can force rsync to do file comparisons based on file size and checksums with the -c switch. With this option, a checksum of all files on the sending side will be generated, and checksums will be generated on the receiving side only for files whose file size is the same as that on the sending side. This results in a much slower backup, but at least it will catch some of the holes in the typical but much faster method of archiving. So every few weeks, or before I swap backup drives, I run the script above adding the -c switch to the rsync options. One could also schedule a cron job to run the above script with the -c switch every month. As long as the checksumming version of the backup script runs before the usual one, the nightly backup without checksumming will see that a backup is already running.

Another option is to directly compare the files with diff. Twice a month I have a cron job run a script containing the command below. If a file differs it will let me know, and I will compare the files with another backup (like my backup at work, using md5sum to compare if the files are large), and recopy the file to the drive with the altered copy (and re-check that file with diff to make sure it took).

diff -rq /home/nick/files/ /home/nick/backup_files/ | tee -a /home/nick/backup_diff.txt

Lastly we come to the computers I actually do work on. While I am actively working on files, I'll keep them in a backup directory on my desktop, which synchronizes itself to a directory on my server every 15 minutes. It is the same script as backup_files_remotely.sh, but set up to run on the computer I am using, with the private key on the local computer, and the public key in the server's $HOME/.ssh/authorized_keys.

So, for the most part I have a fully automated backup system. The only thing I need to do is check the logs once and a while to make sure nothing broke, and swap out my backup drives.


Back to Home
Contact Me