How to make an offline (static) copy of a website with wget

By Alvin Alexander. Last updated: September 9, 2021

If you want to make an offline copy/mirror of a website using the GNU/Linux wget command, a command like this will do the trick for you:

wget --mirror            \
     --convert-links     \
     --html-extension    \
     --wait=2            \
     -o log              \
     http://howisoldmybusiness.com

This will create a folder with the name howisoldmybusiness.com with the entire contents, suitable for offline viewing with a web browser. It is a static copy, meaning forms and some javascript used for interactivity will likely not work.

One thing I learned about this command is that it doesn’t make a copy of “rollover” images, i.e., images that are changed by JavaScript when the user rolls over them. I haven’t investigated how to fix this yet, but the easiest thing to do is to copy the /images directory from the server, assuming that you’re making a static copy of your own website, as I am doing. Another thing you can do is manually download the rollover images.

Why use the --html-extension option?

Note that you won’t always need to use the --html-extension option with wget, but because the original version of my How I Sold My Business website did not use any extensions at the end of the URLs, it was necessary in this case.

What I mean by that is that the original version of my website had URLs like this:

http://howisoldmybusiness.com/content/friday-october-18-2002

Notice that there is no .html extension at the end of that URL. Therefore, what happens if you use wget without the --html-extension option is that you end up with a file on your local computer with this name:

content/friday-october-18-2002

Even if you use MAMP or WAMP to serve this file from your local filesystem, they aren’t going to know that this is an HTML file, so essentially what you end up with is a worthless file.

Conversely, when you do use the --html-extension option, you end up with this file on your local filesystem:

content/friday-october-18-2002.html

On a Mac, that file is easily opened in a browser, and you don’t even need MAMP. wget is also smart enough to change all the links within the offline version of the website to refer to the new filenames, so everything works.

Explanation of the wget options used

Here’s a short explanation of the options I used in that wget command:

--mirror
    Turn on options suitable for mirroring. This option turns on 
    recursion and time-stamping, sets infinite recursion depth,
    and keeps FTP directory listings. It is currently equivalent to 
    ‘-r -N -l inf --no-remove-listing’. 

--convert-links
    After the download is complete, convert the links in the document
    to make them suitable for local viewing.

--html-extension

-o foo
    write "log" output to a file named "foo"

--wait=seconds
    Wait the specified number of seconds between the retrievals.
    Use of this option is recommended, as it lightens the server load 
    by making the requests less frequent.

Depending on the web server settings of the website you’re copying, you may also need to use the -U option, which works something like this:

-U Mozilla
   mascarade as a Mozilla browser

That option lets you set the wget user agent. (I suspect that the string you use may need to be a little more complicated than that, but I didn’t need it, and didn’t investigate it further.)

I got most of these settings from the GNU wget manual.

Go Back