Python HTTP Web Services - urllib, httplib2

bogotobogo.com site search:

Web Service

Web service ia s software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format, Web Services Description Language (WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards. - from wiki

We can identify two major classes of Web services, REST-compliant Web services, in which the primary purpose of the service is to manipulate XML representations of Web resources using a uniform set of "stateless" operations; and arbitrary Web services, in which the service may expose an arbitrary set of operations.

Philosophically, we can describe HTTP web services in 12 words: exchanging data with remote servers using nothing but the operations of http.

HTTP Requests

It's really important to understand the details of the HTTP protocol in order to build and debug effective web services. The HTTP protocol is always organized around a client sending a request to the server. We have our server and our client, and we have a request that's being sent. Now, one of the key parts of that request is the action that the client is asking the server to take on its behalf. Therefore, every request has a request method which is the action or the verb that the client is asking the server to take on its behalf. All requests are dictated as a request method that should be applied to a specific resource on the server.

For example, when we go and access a webpage, using our browser, what we're typically doing is sending a request that has the get request method, and the resource is usually a webpage such as 'index.html', which is usually the core webpage at a website. In this case, the get is the request method, and the resource is the 'index.html'.

The resource is typically specified as a path to a resource on the server. So, the resource will typically be a path. We'll see something like this, '/index.html' or 'foo/mypage' or some other resource that we would like to access.

There's a variety of request methods in http protocol. One of the most important ones that we want to understand is GET which is a simple request to the server. It can be without data or possibly include a little bit of data and it's asking the server to get some resource that's there and return it to us.

Another important one is POST. POST is typically used when we want to send a lot of data to the server. So, for example, if we want to go and Post an image to the server so that it can then store and serve up at some later point in time. Post is probably something we will be using to do that. We probably aren't going to be sending an image through get. We're going to be sending some small amount of data through Get.

Post is used for more general purpose such as sending data to the server.

If we want to get data from the server, use http GET.

So, when we typed in the following in the url:

http://www.bogotobogo.com/foo

Internally, we're issuing the following request line requesting a page from a http server:

GET /foo http/1.1

GET - method
/foo - path
http/1.1 - version

Here is another example a little bit more complicated, and guess what would be the request line:

http://www.bogotobogo.com/foo/bogo.png?p=python#fragment

The answer is:

GET /foo/bogo.png?p=python http/1.1

The host www.bogotobogo.com will be used for connection, and the fragment stays at the client side.

Actually, the http request line has more, called header which consists of name: value pair:

Host: www.bogotobogo.com
User-Agent: Chrome

Though we made a connection already, we need the Host because a web server may be hosting several domains.

If we want to send new data to the server, use http POST. Some more advanced http web service APIs also allow creating, modifying, and deleting data, using http PUT and http DELETE. That's it. No registries, no envelopes, no wrappers, no tunneling. The verbs built into the http protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
The main advantage of this approach is simplicity, and its simplicity has proven popular. Data - usually xml or json - can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an http library for downloading it. Debugging is also easier; because each resource in an http web service has a unique address (in the form of a url), we can load it in our web browser and immediately see the raw data. - from http://getpython3.com/diveintopython3/http-web-services.html

HTTP Responses

To the folowing http request:

GET /foo http/1.1

The server's response should something like this;

http/1.1 200 OK

The line is called status line:

200: status code
OK: reason phrase

Here are some of the status codes:

Status	Meaning	Example
1xx	Information	100 = server agrees to handle client's request
2xx	Success(OK)	200 = request succeeded, 204 = no content present
3xx	Redirection	301 = page moved, 304 = cached page still valid
4xx	Client error	403 = forbidden page, 404 = page not found
5xx	Server error	500 = internal server error, 503 = try again later

The header has more components, and we will see them later in this page.

Python Web Service

Python 3 comes with two different libraries for interacting with http web services:

http.client is a low-level library that implements rfc 2616, the http protocol.
urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both http and ftp servers, automatically follows http redirects, and handles some common forms of http authentication.

So which one should we use? Neither of them. Instead, we should use httplib2, an open source third-party library that implements http more fully than http.client but provides a better abstraction than urllib.request.

On HTTP Clients

There are five important features which all http clients should support.

Caching
Last-modified checking
ETag checking
Compression
Redirect

Features of http clients should support - 1. Caching

Network access is incredibly expensive!

The most important thing to realize that any type of web service is that network access is incredibly expensive. It takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, a latency can still be higher than we anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack, and so on.

http is designed with caching in mind. There is an entire class of devices called caching proxies whose only job is to sit between us and the rest of the world and minimize network access. Our company or ISP almost certainly maintains caching proxies, even if we're not aware of them. They work because caching is built into the http protocol.

Chrome

After installing chrome-extension-http-headers

we can get header http header info:

Firefox

Install Firefox Web Developer Add-Ons

Page Informatiton->Headers

Caching speeds up repeated page views and saves a lot of traffic by preventing downloading of unchanged content every page view. We can use Cache-Control: max-age= to inform browser that the component won't be changed for defined period. This way we avoid unneeded further requests if browser already has the component in its cache and therefore primed-cache page views will be performed faster.

After installing the HTTP Response Browser, we can see the response to our request.

The following example shows the response to the request of an image.
We visit http://www.bogotobogo.com/python/images/python_http_web_services/Browsers.png in our browser. That page includes an image. When our browser downloads that image, the server includes the following http headers:

Unfortunately, my site does not have cache-control/Expires. So, let's look at another site:

The Cache-Control and Expires headers tell our browser (and any caching proxies between us and the server) that this image can be cached for up to 2 minutes (from Sun, 20 Jan 2013 22:16:26 GMT to Sun, 20 Jan 2013 22:18:26 GMT). And if, within that period, we visit the page, our browser will load the page from its cache without generating any network activity whatsoever.

But let's suppose, we're trying to download an image and we have about a month before it expires. And our browser purges the image from our local cache for some reason. But the http headers said that this data could be cached by public caching proxies. (Technically, the important thing is what the headers don't say; the Cache-Control header doesn't have the private keyword, so this data is cacheable by default.) Caching proxies are designed to have tons of storage space, probably far more than our local browser has allocated.

If our ISP maintain a caching proxy, the proxy may still have the image cached. When we visit the site again, our browser will look in its local cache for the image, but it won't find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that our request will never reach the remote server; in fact, it will never leave our ISP's network. That makes for a faster download (fewer network hops).

http caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be.

Python's http libraries do not support caching, but httplib2 does.

Features of http clients should support - 2. Last modified checking

Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn't. gizmodo.com's feed is updated every few minutes, but site's feed may not change for days or weeks at a time. In the latter case, I don't want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they're respecting my cache headers which said don't bother checking this feed for weeks). On the other hand, I don't want clients downloading my entire feed once an hour if it hasn't changed!

http has a solution to this, too. When we request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed.

When we request the same data a second (or third or fourth) time, we can send an If-Modified-Since header with our request, with the date we got back from the server last time. If the data has changed since then, then the server gives us the new data with a 200 status code. But if the data hasn't changed since then, the server sends back a special http 304 status code, which means this data hasn't changed since the last time we asked for it. We can test this on the command line, using cURL:

$ curl -I -H "If-Modified-Since: Sun, 20 Jan 2013 19:56:58 GMT"
 http://www.bogotobogo.com/python/images/python_http_web_services/Browsers.png
HTTP/1.1 304 Not Modified
Date: Sun, 20 Jan 2013 23:21:23 GMT
Server: Apache
ETag: "664002c-2f09-4d3bdbf48a672"
Vary: Accept-Encoding

This is an improvement because when the server sends a 304, it doesn't re-send the data. All we get is the status code. Even after our cached copy has expired, last-modified checking ensures that we won't download the same data twice if it hasn't changed. (Actually, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially expires, in the hopes that the data hasn't really changed and the next request responds with a 304 status code and updated cache information.)

Python's http libraries do not support last-modified date checking, but httplib2 does.

Features of http clients should support - 3. ETag checking

An ETag or entity tag, is part of HTTP, the protocol for the World Wide Web. It is one of several mechanisms that HTTP provides for web cache validation, and which allows a client to make conditional requests. This allows caches to be more efficient, and saves bandwidth, as a web server does not need to send a full response if the content has not changed. - from wiki

ETags are an alternate way to accomplish the same thing as the last-modified checking. With ETags, the server sends a hash code in an ETag header along with the data we requested.

200 OK
Date: Mon, 21 Jan 2013 00:07:51 GMT
Content-Encoding: gzip
Last-Modified: Sun, 20 Jan 2013 19:56:58 GMT
Server: Apache
ETag: "664002c-2f09-4d3bdbf48a672"
Vary: Accept-Encoding
Content-Type: image/png
Accept-Ranges: bytes
Content-Length: 12054

The second time we request the same data, we include the ETag hash in an If-None-Match header of our request. If the data hasn't changed, the server will send us back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn't send us the same data a second time. By including the ETag hash in our second request, we're telling the server that there's no need to re-send the same data if it still matches this hash, since we still have the data from the last time.

$ curl -I -H "If-None-Match: \"664002c-2f09-4d3bdbf48a672\""
 http://www.bogotobogo.com/python/images/python_http_web_services/Browsers.png
HTTP/1.1 304 Not Modified
Date: Mon, 21 Jan 2013 00:14:50 GMT
Server: Apache
ETag: "664002c-2f09-4d3bdbf48a672"
Vary: Accept-Encoding

Note that ETags are commonly enclosed in quotation marks, but the quotation marks are part of the value. That means we need to send the quotation marks back to the server in the If-None-Match header.

Python's http libraries do not support Etags, but httplib2 does.

Features of http clients should support - 4. Compression

When we talk about http web services, we're almost always talking about moving text-based data back and forth over the wire. It could be xml, json, or it could be just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 25K bytes uncompressed, but would be 6K bytes after gzip compression. That's just 25% of the original size!

http supports several compression algorithms. The two most common types are gzip and deflate. When we request a resource over http, we can ask the server to send it in compressed format. We include an Accept-encoding header in our request that lists which compression algorithms we support. If the server supports any of the same algorithms, it will send us back compressed data (with a Content-encoding header that tells us which algorithm it used). Then it's up to us to decompress the data.

Python's http libraries do not support compressions, but httplib2 does.

Features of http clients should support - 5. Redirect

Web sites are keep changing. Even web services can reorganize and even the domain might move. Every time we request any kind of resource from an http server, the server includes a status code in its response:

200: everything's normal
404: page not found
300: redirection

http has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301.

302: a temporary redirect; it means oops, that got moved over here temporarily, and then gives the temporary address in a Location header.
If we get a 302 status code and a new address, the http specification says we should use the new address to get what we asked for, but the next time we want to access the same resource, we should retry the old address.
301: a permanent redirect; it means oops, that got moved permanently, and then gives the new address in a Location header.
if we get a 301 status code and a new address, we're supposed to use the new address from then on.

The urllib.request module automatically follow redirects when it receives the appropriate status code from the http server, but it doesn't tell us that it did so. We'll end up getting data we asked for, but we'll never know that the underlying library helpfully followed a redirect for us. So we'll continue pounding away at the old address, and each time we'll get redirected to the new address, and each time the urllib.request module will helpfully follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for us.

compressions, but httplib2 handles permanent redirects for us. Not only will it tell us that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected urls before requesting them.

Downloading Over HTTP

Suppose we want to download a resource over http, such as an Atom feed. Being a feed, we're not just going to download it once; we're going to download it over and over again. (Most feed readers will check for changes once an hour.) Let's do it the quick-and-dirty way first, and then see how we can improve it later.

>>> import urllib.request
>>> a_url = 'http://www.w3.org/2005/Atom'
>>> data = urllib.request.urlopen(a_url).read()
>>> type(data)
<class 'bytes'>
>>> print(data)
b'<?xml version="1.0" encoding="utf-8"?>\n
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">\n
<head>\n  
<meta name="generator" content=\n  
...

The urllib.request module has a urlopen() function that takes the address of the page we want, and returns a file-like object that we can just read() from to get the full contents of the page.

The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. http servers don't deal in abstractions. If we request a resource, we get bytes. If we want it as a string, we'll need to determine the character encoding and explicitly convert it to a string.

We can do whatever we want with this urllib.request, however, once we start thinking in terms of a web service that we want to access on a regular basis, then we will soon feel the pain. Let's find other options.

Downloading Over HTTP - httplib2

let's turn on the debugging features of Python's http library and see what's being sent over the network.

>>> from http.client import HTTPConnection
>>> HTTPConnection.debuglevel = 1
>>> from urllib.request import urlopen
>>> response = urlopen('http://www.w3.org/2005/Atom')
send: b'GET /2005/Atom HTTP/1.1\r\n
Accept-Encoding: identity\r\n
Host: www.w3.org\r\n
Connection: close\r\n
User-Agent: Python-urllib/3.2\r\n
\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: Server header: Content-Location header: Vary header: 
TCN header: Last-Modified header: ETag header: Accept-Ranges header: 
Content-Length header: Cache-Control header: Expires header: P3P header: 
Connection header: Content-Type

The urllib.request relies on another standard Python library, http.client. Normally we don't need to touch http.client directly since the urllib.request module imports it automatically. But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the http server.

Now that the debugging flag is set, information on the http request and response is printed out in real time. As we can see, when we request the Atom feed, the urllib.request module sends five lines to the server.

The first line specifies the http verb we're using, and the path of the resource (minus the domain name).
The second line specifies the domain name from which we're requesting this feed.
The third line specifies the compression algorithms that the client supports. The urllib.request does not support compression by default.
The fourth line specifies the name of the library that is making the request. By default, this is Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by adding a User-Agent header to the request which will override the default value.

Now let's look at what the server sent back in its response. Continued from the previous example.

>>> print(response.headers.as_string())
Date: Mon, 21 Jan 2013 04:37:06 GMT
Server: Apache/2
Content-Location: Atom.html
Vary: negotiate,Accept-Encoding
TCN: choice
Last-Modified: Sat, 13 Oct 2007 02:19:32 GMT
ETag: "90a-43c56773a3500;4bc4eec228980"
Accept-Ranges: bytes
Content-Length: 2314
Cache-Control: max-age=21600
Expires: Mon, 21 Jan 2013 10:37:06 GMT
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Connection: close
Content-Type: text/html; charset=utf-8
>>> data = response.read()
>>> len(data)
2314

The response returned from the urllib.request.urlopen() function contains all the http headers the server sent back. It also contains methods to download the actual data.

The server tells us when it handled our request. This response includes a Last-Modified header and ETag header.

The data is 2314 bytes long. Notice what isn't here: a Content-encoding header. Our request stated that we only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data.

This response also includes caching headers that state that this feed can be cached for up to 6 hours (21600 seconds), and finally, download the actual data by calling response.read(). As we can tell from the len() function, this fetched a total of 2314 bytes.

As we can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but http compression is opt-in. We didn't ask for it, so we didn't get it. That means we're fetching 2314 bytes when we could have fetched less.

Let's see it gets worse! To see just how inefficient this code is, let's request the same feed a second time.

>>> response2 = urlopen('http://www.w3.org/2005/Atom')
send: b'GET /2005/Atom HTTP/1.1\r\n
Accept-Encoding: identity\r\n
Host: www.w3.org\r\n
Connection: close\r\n
User-Agent: Python-urllib/3.2\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: Server header: Content-Location header: Vary header: 
TCN header: Last-Modified header: ETag header: Accept-Ranges header: 
Content-Length header: Cache-Control header: Expires header: P3P header: 
Connection header: Content-Type 
...

Notice that it hasn't changed! It's exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression.

httplib2

>>> import httplib2
Traceback (most recent call last):
  File "", line 1, in 
    import httplib2
ImportError: No module named httplib2

So, we need to install http://code.google.com/p/httplib2/.

Downloads:

httplib2-0.7.7.tar.gz
httplib2-0.7.7.zip

To install:

$ python setup.py install

To use httplib2, create an instance of the httplib2.Http class.

>>> import httplib2
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.w3.org/2005/Atom')
>>> response.status
200
>>> content[:52]
b'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE htm'
>>> len(content)
2314

The primary interface to httplib2 is the Http object. We should always pass a directory name when we create an Http object. The directory does not need to exist; httplib2 will create it if necessary.

Once we have an Http object, retrieving data is as simple as calling the request() method with the address of the data we want. This will issue an http GET request for that url.

The request() method returns two values:

The first is an httplib2.Response object, which contains all the http headers the server returned. For example, a status code of 200 indicates that the request was successful.
The content variable contains the actual data that was returned by the http server. The data is returned as a bytes object, not a string. If we want it as a string, we'll need to determine the character encoding and convert it ourselves.

httplib2 - caching

To use Caching, we should always create an httplib2.Http object with a directory name.

>>> response2, content2 = h.request('http://www.w3.org/2005/Atom')
>>> response2.status
200
>>> content2[:52]
b'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE htm'
>>> len(content2)
2314

Nothing unusual. Same results from the previous run!

However, if we run it after relaunching Python shell, a surprise will be waiting for us.

>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.w3.org/2005/Atom')
>>> len(content)
2314
>>> response.status
200
>>> response.fromcache
True

We turned on debugging. This was the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.

We created an httplib2.Http object with the same directory name as before. Then, requested the same url as before. More precisely, nothing got sent to the server, and nothing got returned from the server. There was absolutely no network activity whatsoever.

However, we did receive some data - in fact, we received all of it. We also received an http status code indicating that the request was successful.

Actually, this response was generated from httplib2's local cache. That directory name we passed in when we created the httplib2.Http object - that directory holds httplib2's cache of all the operations it's ever performed.

We previously requested the data at this url. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 6 hours (Cache-Control: max-age=21600, which is 6 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which we passed in when we create the Http object). That cache hasn't expired yet, so the second time we request the data at this url, httplib2 simply returns the cached result without ever hitting the network.

httplib2 handles http caching automatically and by default.

Now, suppose we have data cached, but we want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. We might think oh, I'll just delete the data from my local cache, then request it again. We could do that, but remember that there may be more parties involved than just us and the remote server. What about those intermediate proxy servers? They're completely beyond our control, and they may still have that data cached, and will happily return it to us because (as far as they are concerned) their cache is still valid.

Instead of manipulating our local cache and hoping for the best, we should use the features of http to ensure that our request actually reaches the remote server.

Instead of manipulating our local cache and hoping for the best, we should use the features of http to ensure that our request actually reaches the remote server.

>>> import httplib2
>>> response2, content2 = h.request('http://www.w3.org/2005/Atom',
		headers={'cache-control':'no-cache'})
send: b'GET /2005/Atom HTTP/1.1\r\n
Host: www.w3.org\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
accept-encoding: gzip, deflate\r\n
cache-control: no-cache\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: 
Server header: 
...
>>> response2.status
200
>>> response2.fromcache
False
>>> print(dict(response2.items()))
{
'status': '200', 
'content-length': '2314', 
'content-location': 'Atom.html', 
'accept-ranges': 'bytes', 
'expires': 'Mon, 21 Jan 2013 12:55:46 GMT', 
'vary': 'negotiate,Accept-Encoding', 
'server': 'Apache/2', 
'tcn': 'choice', 
'last-modified': 'Sat, 13 Oct 2007 02:19:32 GMT', 
'connection': 'close', 
'-content-encoding': 'gzip', 
'etag': '"90a-43c56773a3500;4bc4eec228980"', 
'cache-control': 'max-age=21600', 
'date': 'Mon, 21 Jan 2013 06:55:46 GMT', 
'p3p': 'policyref="http://www.w3.org/2001/05/P3P/p3p.xml"', 
'content-type': 'text/html; charset=utf-8'
}

httplib2 allows us to add arbitrary http headers to any outgoing request. In order to bypass all caches (not just our local disk cache, but also any caching proxies between us and the remote server), add a no-cache header in the headers dictionary.

Now we see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions - as part of the incoming response and as part of the outgoing request. It noticed that we added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.

This response was not generated from our local cache. We knew that, of course, because we saw the debugging information on the outgoing request. But it's nice to have that programmatically verified.

The request succeeded; we downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of http headers along with the feed data. That includes caching headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next time we request this feed. Everything about http caching is designed to maximize cache hits and minimize network access. Even though we bypassed the cache this time, the remote server would really appreciate it if we would cache the result for next time.

httplib2 - Last-Modified and ETag

The Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that we can completely avoid all network access until the cache expires. Given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data unless we explicitly bypass the cache.

But what about the case where the data might have changed, but hasn't? http defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn't changed, the server sends back a 304 status code and no data. So there's still a round-trip over the network, but we end up downloading fewer bytes.

This time, instead of the feed, we're going to download the site's home page, which is html. Since this is the first time we've ever requested this page, httplib2 has little to work with, and it sends out a minimum of headers with the request. The response contains a multitude of http headers but no caching information. However, it does include both an ETag and Last-Modified header:

>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.w3.org/')
send: b'GET / HTTP/1.1\r\n
Host: www.w3.org\r\n
accept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: 
...
>>> print(dict(response.items()))
{
'status': '200', 
'content-length': '32868', 
'content-location': 'Home.html', 
'accept-ranges': 'bytes', 
'expires': 'Tue, 22 Jan 2013 05:06:02 GMT', 
'vary': 'negotiate,accept', 
'server': 'Apache/2', 
'tcn': 'choice', 
'last-modified': 'Mon, 21 Jan 2013 23:07:22 GMT', 
'connection': 'close', 
'etag': '"8013-4d3d486075e80;89-3f26bd17a2f00"', 
'cache-control': 'max-age=600', 
'date': 'Tue, 22 Jan 2013 05:07:40 GMT, 
'p3p': 'policyref="http://www.w3.org/2001/05/P3P/p3p.xml"', 
'content-type': 
'text/html; charset=utf-8'
}
>>> len(content)
32868

Now, we're going to request the same page again, with the same Http object and the same local cache.

>>> response, content = h.request('http://www.w3.org/')
>>> response.fromcache
True

httplib2 - Compression

>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.w3.org/')
send: b'GET / HTTP/1.1\r\n
Host: www.w3.org\r\n
accept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: 
...
>>> print(dict(response.items()))
{
'status': '200', 
'content-length': '32868', 
'content-location': 'Home.html', 
'accept-ranges': 'bytes', 
'expires': 'Tue, 22 Jan 2013 05:06:02 GMT', 
'vary': 'negotiate,accept', 
'server': 'Apache/2', 
'tcn': 'choice', 
'last-modified': 'Mon, 21 Jan 2013 23:07:22 GMT', 
'connection': 'close', 
'etag': '"8013-4d3d486075e80;89-3f26bd17a2f00"', 
'cache-control': 'max-age=600', 
'date': 'Tue, 22 Jan 2013 05:07:40 GMT, 
'p3p': 'policyref="http://www.w3.org/2001/05/P3P/p3p.xml"', 
'content-type': 
'text/html; charset=utf-8'
}

Every time httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression. In this case, the server's response was not clear, but it usually gives us info on payloads. By the time the request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable.

Because we did not get the payloads info, we can check it using response['-content-encoding']:

>>> response['-content-encoding']
'gzip'

httplib2 - Redirect

I do not have control over the site, http://www.w3.org/, I'll use my site and put a redirection there.

"http://www.bogotobogo.com/python/python_http_web_services_redirect.php"

<?php
header('HTTP/1.1 301 Moved Permanently');
header('Location: http://www.bogotobogo.com/python/python_http_web_services.php');
exit();
?>

will be redirected to

"http://www.bogotobogo.com/python/python_http_web_services.php" which is the page we're at now.

Let's do the same thing as before:

>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.bogotobogo.com/python/python_http_web_services_redirect.php')
send: b'GET /python/python_http_web_services_redirect.php HTTP/1.1\r\n
Host: www.bogotobogo.com\r\n
accept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
\r\n'
reply: ''
send: b'GET /python/python_http_web_services_redirect.php HTTP/1.1\r\n
Host: www.bogotobogo.com\r\n
accept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
...
send: b'GET /python/python_http_web_services.php HTTP/1.1\r\n
Host: www.bogotobogo.com\r\n
accept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n
\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
...

And the response we got from the final url:

>>> response
{
'status': '200',  
'content-length': '53411', 
'content-location': 'http://www.bogotobogo.com/python/python_http_web_services.php',  
'vary': 'Accept-Encoding', 
'server': 'Apache', 
'-content-encoding': 'gzip', 
'date': 'Tue, 22 Jan 2013 06:00:17 GMT', 
'content-type': 'text/html'
}

if we want more information about the intermediate url:

>>> response.previous
{
'status': '301',  
'content-length': '0', 
'content-location': 'http://www.bogotobogo.com/python/python_http_web_services_redirect.php', 
'vary': 'Accept-Encoding', 
'server': 'Apache', '-content-encoding': 'gzip', 
'location': 'http://www.bogotobogo.com/python/python_http_web_services.php', 
'date': 'Tue, 22 Jan 2013 06:00:17 GMT', 
'-x-permanent-redirect-url': 'http://www.bogotobogo.com/python/python_http_web_services.php', 
'content-type': 'text/html'
}

If we request the same page again, there will be no second request for the final url.

>>> response2, content2 = h.request('http://www.bogotobogo.com/python/python_http_web_services_redirect.php')
send: b'GET /python/python_http_web_services.php HTTP/1.1\r\n
Host: www.bogotobogo.com\r\naccept-encoding: gzip, deflate\r\n
user-agent: Python-httplib2/0.7.7 (gzip)\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
...