You may want to read a similar post for cURL, I wrote about the reason for preventing downloading same content.
1 The flow
It seems fairly easy, too:
% wget http://example.com -S -N --2012-03-23 20:27:23-- http://example.com/ Resolving example.com... 192.0.43.10 Connecting to example.com|192.0.43.10|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 302 Found Location: http://www.iana.org/domains/example/ Server: BigIP Connection: Keep-Alive Content-Length: 0 Location: http://www.iana.org/domains/example/ [following] --2012-03-23 20:27:23-- http://www.iana.org/domains/example/ Resolving www.iana.org... 192.0.32.8 Connecting to www.iana.org|192.0.32.8|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 23 Mar 2012 12:27:24 GMT Server: Apache/2.2.3 (CentOS) Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT Vary: Accept-Encoding Connection: close Content-Type: text/html; charset=UTF-8 Length: unspecified [text/html] Server file no newer than local file `index.html' -- not retrieving.
You may have noticed the difference, it does not use If-Modified-Since as cURL does. This request should be a HEAD request, Wget determines whether to GET or not based on the Last-Modified and Content-Length, which was not sent by the server and this would be a problem when a server sends it with 0 in length, but actually the contents length is non-zero. A case for it is Blogger:
% wget http://oopsbroken.blogspot.com --server-response --timestamping --no-verbose HTTP/1.0 200 OK X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8 Expires: Fri, 23 Mar 2012 12:38:47 GMT Date: Fri, 23 Mar 2012 12:38:47 GMT Cache-Control: private, max-age=0 Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Content-Length: 0 Server: GSE Connection: Keep-Alive HTTP/1.0 200 OK X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8 Expires: Fri, 23 Mar 2012 12:38:47 GMT Date: Fri, 23 Mar 2012 12:38:47 GMT Cache-Control: private, max-age=0 Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE 2012-03-23 20:38:48 URL:http://oopsbroken.blogspot.com/ [44596] -> "index.html" [1]
Every time you run, it always gets updated even the content is the same.
In the infopage of Wget:
A file is considered new if one of these two conditions are met: 1. A file of that name does not already exist locally. 2. A file of that name does exist, but the remote file was modified more recently than the local file. [snip] If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the time-stamps say.
If the sizes do not match, then Wget will GET the file. In case of Blogger, it returns with:
Content-Length: 0
Which is incorrect, since the contents length isnt not really zero. The problem is Wget believes it. The file length is 44596, they are not match, therefore Wget updates the file.
To avoid this, you need --ignore-length option:
% wget http://oopsbroken.blogspot.com --server-response --timestamping --no-verbose --ignore-length HTTP/1.0 200 OK X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8 Expires: Fri, 23 Mar 2012 12:42:06 GMT Date: Fri, 23 Mar 2012 12:42:06 GMT Cache-Control: private, max-age=0 Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Content-Length: 0 Server: GSE
Now Wget does not try to get the file because of the Content-Length.
2 Issues
There are several issues or difficulties when using Wget instead of cURL.
2.1 Incompatible with -O
As you can see, I didnt use -O for specifying output file because it is incompatible with -N (--timestamping), which disables -N. You need to deal with the downloaded filename. Basically, you can use basename or it can be index.html or some bizarre names if query is presented.
2.2 When to process
You can not rely on status code, you need to check if file gets updated in either seeing timestamp change of file or parsing output of Wget to see if a file is saved. Not really a smart way to deal with.
However, if you saves timestamp, then it can be used to check and you dont really need to keep a local file. Well, not true. You still need to keep a local file since Wget need to get timestamp from local file. I cant find anyway to specify a timestamp.
3 Conclusion
I recommend using cURL instead of Wget. You may manually add request header for dealing with these issues, but using cURL is much easier. So why bother?
If you have good way to deal with the issues (not really manually do the whole process in your script), feel free to comment with codes. There may be some useful options I miss when I read the manpage and infopage.