YJL: Downloading only when modified using cURL in Bash

Downloading only when modified using cURL in Bash

Friday, March 23, 2012

When writing a script for processing files on Internet, its a good idea to take advantage of If-Modified-Since header because:

You dont waste bandwidth for possibly exactly the same content and
You dont need to re-process if you know the content is unchanged.

Contents

1 The flow
2 When to process
3 Without a saved file
4 Conclusion

1 The flow

cURL has an option for it, -z <date expression>, where date expression can be either a format listed in manpage of curl_getdate or a filename of existing file. To use it with -o FILE is easy:

 ~ $ curl http://example.com -z index.html -o index.html --verbose --silent --location
Warning: Illegal date format for -z/--timecond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.
* About to connect() to example.com port 80 (#0)
*   Trying 192.0.43.10... connected
* Connected to example.com (192.0.43.10) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: example.com
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 302 Found
< Location: http://www.iana.org/domains/example/
< Server: BigIP
* HTTP/1.0 connection set to keep alive!
< Connection: Keep-Alive
< Content-Length: 0
<
* Connection #0 to host example.com left intact
* Issue another request to this URL: 'http://www.iana.org/domains/example/'
* About to connect() to www.iana.org port 80 (#1)
*   Trying 192.0.32.8... connected
* Connected to www.iana.org (192.0.32.8) port 80 (#1)
> GET /domains/example/ HTTP/1.0
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: www.iana.org
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 23 Mar 2012 11:31:14 GMT
< Server: Apache/2.2.3 (CentOS)
< Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
< Vary: Accept-Encoding
< Connection: close
< Content-Type: text/html; charset=UTF-8
<
{ [data not shown]
* Closing connection #1
* Closing connection #0

There is a warning about the date syntax, it is safe to ignore since the file index.html has not been created yet. On the second run:

 ~ $ curl http://example.com -z index.html -o index.html --verbose --silent --location
* About to connect() to example.com port 80 (#0)
*   Trying 192.0.43.10... connected
* Connected to example.com (192.0.43.10) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: example.com
> Accept: */*
> If-Modified-Since: Fri, 23 Mar 2012 11:33:24 GMT
>
* HTTP 1.0, assume close after body
< HTTP/1.0 302 Found
< Location: http://www.iana.org/domains/example/
< Server: BigIP
* HTTP/1.0 connection set to keep alive!
< Connection: Keep-Alive
< Content-Length: 0
<
* Connection #0 to host example.com left intact
* Issue another request to this URL: 'http://www.iana.org/domains/example/'
* About to connect() to www.iana.org port 80 (#1)
*   Trying 192.0.32.8... connected
* Connected to www.iana.org (192.0.32.8) port 80 (#1)
> GET /domains/example/ HTTP/1.0
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: www.iana.org
> Accept: */*
> If-Modified-Since: Fri, 23 Mar 2012 11:33:24 GMT
>
< HTTP/1.1 304 NOT MODIFIED
< Date: Fri, 23 Mar 2012 11:33:54 GMT
< Server: Apache/2.2.3 (CentOS)
< Connection: close
<
* Closing connection #1
* Closing connection #0

As you can see the server returned 304, no content was transfered and index.html left untouched.

2 When to process

The bandwidth is saved as demonstrated in previous section, next question is to know when do our script to process? The key is to get the HTTP Response Code from the server. cURL provides a list of options to format output by using --write-out or simply -w. Here is the command we need:

 ~ $ curl http://example.com -z index.html -o index.html --silent --location --write-out %{http_code}
304

To utilize this, here is a complete code:

if [[ "$(curl http://example.com -z index.html -o index.html -s -L -w %{http_code})" == "200" ]]; then
  # code here to process index.html because 200 means it gets updated
  blah blah blah
fi

I also shortened the command with single letter option names. When the response code is 200, then it means the file has been updated, your script will need to process. The example code does not deal with errors, you may want to say the response code to a variable and check up on the variable with case and respond accordingly.

3 Without a saved file

I think not always the file will be saved on disk. You may process it, then discard it. If so, you need to save the timestamp for later use. A possible work flow may look like:

do_process () {
  # process here
  # blah blah blah
  # save timestamp
  stat -c %Y index.html > index.html.timestamp
  rm index.html
}

if [[ -f index.html.timestamp ]]; then
  # not first run
  [[ "$(curl -s http://example.com -z "$(date --rfc-2822 -d $(<index.html.timestamp))" -o index.html -s -L -w %{http_code})" == "200" ]] && do_process
else
  # first run
  curl http://example.com -o index.html -s -L
  do_process
fi

First to check if timestamp file exists, if not, then run cURL directly and call the process function. The process function saves timestamp of index.html after the file has been processed successfully, it will be used for next call of cURL.

When the timestamp file exits, we use date to convert it to RFC 2822 format which cURL can understand. The $(<...) is equivalent to $(cat ...), also date can accept date in Unix time by prefixing .

4 Conclusion

You can do it without really having a file saved even temporarily by piping out to standard out. But you will need to parse the response header for timestamp and the header is mixed with content, you also need to parse that part, too. Saving to a temporary file is much easy to do.

Also, another way to check is to compare the timestamps instead of response code, that is really up to your liking.

Dealing with timestamp for preventing bandwidth waste and process time waste is not very hard with cURL. It will be nice for your script to be able to take care of that when it gets mature enough.

I am also writing a similar post for Wget, which may be a little tricky.

8 comments:

Rune JensenSeptember 12, 2013 at 9:49 PM
Why not just use the last modified of the file itself? It has a flag AFAIKS, its called -r

then you do not need to store a timestamp. Its already in the file itself.
ReplyDelete
livibetterSeptember 12, 2013 at 11:34 PM
First of all, I am not sure whose -r you are talking about, I checked ls, stat, curl, even wget. There didnt seem to be the one you are talking. But this sounds like a good idea, anyway. There is a -R as in --remote-time. However, from the description, it didnt seem to enable downloading only when the file gets updated:

When used, this will make curl attempt to figure out the timestamp of the remote file, and if that is available make the local file get that same timestamp.

So, it really doesnt help.

Besides, if choosing not to keep the file, even curl has an option to check timestamp and download only if updated. Since you dont have file, you need to keep the timestamp elsewhere.
ReplyDelete
Rune JensenSeptember 14, 2013 at 1:46 AM
Youre right, I cannot seem to find iv now either. However, there is another flag, which should work, the -z time cond

"
TIME CONDITIONS

HTTP allows a client to specify a time condition for the document it
requests. It is If-Modified-Since or If-Unmodified-Since. Curl allows you to
specify them with the -z/--time-cond flag.

For example, you can easily make a download that only gets performed if the
remote file is newer than a local copy. It would be made like:

curl -z local.html http://remote.server.com/remote.html
"

...It just doesnt work. I get errors when I try it.
ReplyDelete
livibetterSeptember 14, 2013 at 2:00 AM
The first line of The Flow section is actually talking about this -z.

There will be warnings when first used, it may be the errors you are talking about. You need to post the messages so I can understand what you were seeing. Anyway, if you use the command and options I used in the post, the -z does work.
ReplyDelete
Rune JensenSeptember 14, 2013 at 3:32 AM
I see that. Now. I will try that, thanks.
ReplyDelete
Rune JensenSeptember 14, 2013 at 3:59 AM
I tried it, but it doesnt seem to respond the 304. Else it works it seems. As for compress, it works very well, on my system it sends deflate, gzip which is very good.

curl http://www.example.com -z index.html -o index.html --verbose --compress

I will dig into it later, because I will need to do some inspection of my server logs. I still have a thinking it shouldnt be nesssary to store a time stamp also, if you already have stored the file itself. But maybe I havent understood everything of this.

But thanks for your answers so far, they have been very helpfull to me.
ReplyDelete
livibetterSeptember 14, 2013 at 4:50 AM
Did you get status code 200?

If you read my post you would see the first time it will get 200, then 304, as it is supposed to be by HTTP specification.

Read carefully my post, see what the process really is. Dont just run commands when you are still not sure the whole thing.

You really need to give more complete information, post complete commands and outputs to pastebin or something, if you do actually want my help, or I will just wild guess whats going on.
ReplyDelete
Seweryn SiewertApril 13, 2016 at 1:07 AM
hi all.
I have got some issue with curl -z command. Does anyone know which format of date should I use: I got this error:

Warning: Illegal date format for -z, --timecond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.

My script:

#!/bin/bash/
today=$(date +%Y%m%d%H%M)
curl -u admin:admin -z "$today" --data "delay=3&force=false&target=/bkps/inc-$today.zip" \
http://localhost:4502/libs/granite/backup/content/admin/backups/
ReplyDelete

Add comment