This is a tiny little tip, and probably any linux guru worth his/her salt knows this, but I just discovered the wget usage to check the time-stamp / last-modified header prior to downloading a file.  Which is pretty cool if you’ve ever setup any shell scripts that fetch/sync something.

I have written some apps in the past that have relied on wget to fetch content, thereby cache it locally (as a backup in case of remote failure, as I’ve had a couple times).  Also it reduces the load if that data is begin shown on the your website/app.  So if 100 users sign on and check something, it doesn’t hit the remote server for 100x fetches of that data, it just falls back to the local copy, then the re-sync takes place 10-15 min later.

Anyways the command to get a timestamp check before downloading a file is:

wget -N

So the above command will only fetch the robots.txt file IF and ONLY if the following is true:

  • A file of that name does not already exist locally.
  • A file of that name does exist, but the remote file was modified more recently than the local file.

Well there you have it, dumb but useful command if you ever need it.  Here is a script that I’ve used in the past to spool & fetch RSS / XML feeds:

# This script will run via CRONTAB and fetch data from the
# urls.txt file, which can be used internally.  This way we minimize
# the number of requests externally for data.
# - created by Jakub
# Read the URLS.TXT file to get the URL/filename
# Formatted:
# ^- URL                                                 ^- filename to save as
for s in `cat "$sourcefile"`;
geturl=`dirname $s`;
filename=`basename $s`;
wget -qN $geturl -pO "$storedir"$filename;