library

technical documents

an easy website content mirror

Posted by: Dan on February 11, 2008 10:06:52 PM +00:00
grab a paper clip, find, wget and cron to set up a simple image mirror

bigdisco

Problem

For one reason or another, you want to split up your http requests based on filetype. In this example, we're mirroring image and flash files. Also, for one reason or another, you don't feel like going through the process of installing any new software. There are plenty of more detailed solutions out there:

  • rsync
  • scp
  • nfs
  • mirror

Solution

On the main server, we'll need to set up a simple bash script that will find all of the files modified in the last 60 minutes:

#!/bin/sh
# find all the images modified in the last 72 images and write the output to a file
cd /path/to/web-root/
find -E img/ -mmin -72 -iregex '(.*\.png$|.*\.jpg$|.*\.gif$|.*\.swf$)' > img/grabme.txt

Save this to a convenient spot in your home directory, like ~/mirror-find.sh.

This is a simple script to find any files modified in the last 72 minutes that end with .png, .jpg. .gif and .swf and then write that list to a text file so we can process it in a moment, on the server doing the mirroring.

Add this to your user's cron to run at the end of every hour, at minute 59:

59 * * * * /bin/sh /home/user/mirror-find.sh 

On the mirror server, we'll need another simple shell script and cron job:

#!/bin/sh
cd /home/user/mirror-root/
/usr/bin/wget -r http://example.tld/img/grabme.txt
/usr/bin/wget -rq -w 2 -B http://example.tld -i example.tld/img/grabme.txt

Save this to a convenient spot in your home directory, like ~/mirror-batch.sh.

This script grabs the latest list of files then uses it in a wget command to retrieve the new/modified files. Note the different wget options which instruct wget to overwrite any existing files (r), to do this task quietly (q), and to use grabme.txt as a list of files (i) to retrieve. The -B option nicely prepends the proper URL scheme and domain to our list.

For this one, we'll set up the cron job to run a couple minutes after the hour:

3 * * * * /bin/sh /home/zybez/mirrors/grab-batch.sh 

Catching New Files

Now, most of the time, in my example, files are modified, and in the timespan between cron runs, the worst that will happen is some visitors will see the older file. Since we're redirecting from the main web server to the image mirror server for the whole /img/ directory, we need to be able to catch any new files added in the last hour as they would generate a 404 and a broken page (missing image).

Most web servers will allow you to specify an error handler, so you can do special things when an HTTP error occurs. In our case, its the 404 Not Found error we want to avoid. We can do this by redirecting back to the main server (with an alternative url) any image file that has yet to be copied to the mirror.

404.php

<?php
$newloc = 'http://media.example.tld' . $_SERVER["REQUEST_URI"];
header('Location: ' . $newloc);
?>

Why do it this way?

This solution appeals to me because it is very simple and easy to adjust. Additionally, we didn't have to create any new users or services to accomplish it. For me, it solved a very specific problem where one particular server was maxing out its monthly bandwidth allotment.

Updated: 05 Apr 10 06:34