The truth about hits (and misses)

A Page on the Web, published in the Solicitors Journal, November 2000

I hope I will be forgiven for this month straying into territory that some may regard as too technical.

Web usage statistics cannot be used to make any inferences about the number of people who have read pages on a site or even the number of pages read. Although those who compile them usually try to make this clear, people still insist on using them.

So if any reliance at all is to be based on these statistics, it is well to be informed about how they are compiled and reported.

Requests

A request (commonly called a ‘hit’) occurs when a web server is asked to provide a page, graphic or other object. Requests may be generated either by a visitor going to a page or by the page itself requesting an object (usually a graphic). Using the number of requests to guage the popularity of a site is misleading because pages with many graphics generate many more requests than sites with simpler graphics.

Page requests

A page request (or page view) is a request for an HTML or text page. This is generally accepted as a better guide to usage than requests. However, the number of pages viewed will be heavily influenced by the architecture of a site. To take the simplest example, a user reading an entire document which is split into 4 pages will request 4 pages. The same document on another site may require only one page view. Amongst other arguments against reliance on page views is that a site with bad navigation will generate more page views because the user gets lost!

Visitors

A visitor is usually defined simply as a unique IP address. A particular IP address may represent a single person or organisation, but often it is shared by many people. If a site uses persistent cookies to identify people, reports can calculate visitors based on a combination of unique IP addresses and cookies.

Visits

A visit (or session) is a collection of requests by a particular visitor at one time. Visits are just estimates because there is no way of knowing that a series of requests actually belongs to the same person, or, for that matter, to the same person during the same visit. Reporting software calculates visits based on several factors including IP addresses, cookies and the delay between consecutive requests. Since the latter is set by the system administrator, visits cannot be used as a comparator between sites.

Caching – the non-visit

The goal of system administrators is to protect web servers from high loads while optimising the speed and reliability of documents served. This is achieved by caching pages. A cache is a copy of pages recently accessed, stored in such a way as to optimise retrieval. Cached pages will, if appropriate, be served up instead of the actual page on the host server, as follows:

Your browser will first look in its own memory.
It will then look in the browser cache on your hard disk.
It may then look in a site cache. If another user on the same site recently retrieved the page, it may be available to the user there.
The site cache may be configured to look in a local regional cache.
The local regional cache may be configured to look in a large regional cache.
If a page is not found in any of the above caches, the page will be requested from the host server, which will first look in its own cache (or accelerator).

Only after failing to be satisfied by all the above caches will a page be requested from its source.

It will be seen, therefore, that most requests for web pages, particularly popular pages, are satisfied by cached copies. Statistics of pages retrieved from caches other than a host server’s own accelerator will not be available to the host server and consequently cannot be reported.

Site statistics are almost meaningless as a measure of usage. Q.E.D.

Binary Law

Legal information in the digital age on the blawg since 2004