tracking without cookies

$Id: tracking-without-cookies.html,v 1.4 2003/02/17 23:38:19 dean Exp $

years ago i wrote a document about tracking with cookies. i included a section about tracking without cookies, which is now out of date. i'm too lazy to describe each of these techniques in detail, but i'll outline several methods which can be used to track without cookies.

first off let's classify three uses of cookies:

  • to customize a page with something like "Hello luser1234,". i'll refer to this use as page-customization. (page-customization is distinct from log analysis because of the real-time nature of page generation vs. the offline nature of log analysis.)
  • to track user activity within a single browser "session". i'll refer to this use as intra-session tracking.
  • to track user activity across browser "sessions" (i.e. across reboots of a computer, or restarts of a browser). i'll refer to this use as inter-session tracking.

remember that none of these techniques are perfect -- but in concert these techniques provide a high degree of certainty that a set of accesses come from a particular user (or small group of users).

URL munging (page-customization, intra-session tracking)

we've all seen godawfully ugly urls which contain crud which a human can't be expected to parse. it's trivial to include a tracking cookie in URLs which allows for intra-session tracking, and page-customization.

keepalive (intra-session tracking)

clients tend not to close HTTP keepalive on their own (until the user wanders off to another website) -- the server tends to time them out and shut them down. but a properly architected server could hold a session open indefinitely. this provides a method for intra-session tracking. (additionally since the client will close the connection eventually after the user has wandered off to another website, this provides a method of knowing how long a user has been reading your pages.)

SSL session ID (intra-session tracking)

public key operations are computationally expensive, and so the TLS/SSL protocols provide a mechanism whereby the server and client can select a session ID that will be used for all requests within a client session. it's now economically feasible (or it will be soon enough) to encrypt all traffic to a website.

this has the added irony that the user will see the nice little lock icon in their browser and feel all secure and snuggly, without realising their browser is happily giving away intra-session hints.

TCP timestamps, and IP id (intra-session tracking)

all IP packets include a 16-bit id field which most hosts just increment from one packet to the next. this might be used to associate requests from a single client -- and has the added benefit that popular NATs don't touch this field. here's a paper describing the use of this technique for counting hosts behind a NAT.

the RFC1423 TCP timestamp field is also similarly predictable, and unmodified by popular NATs.

Last-Modified and ETag (inter-session tracking)

other than cookies, there's typically only one other type of data a webserver can cause a browser to store on its local harddrive -- cacheable web content. this technique attempts to get the browser to store unique id information in its cache in a manner which will be communicated to the server at a later date. (the later communication will be via a GET If-Modified-Since, or If-None-Match.)

Last-Modified timestamps can be selected from a range of, say, 10 million seconds near the year 2038 (the end of time in current 32-bit unix time_t). this allows for tracking 10 million users. it's worth noting that 10 million seconds is only 116 days... and combined with other techniques you could get another order of magnitude easily in this representation without severely consuming too much of a time_t.

ETags have completely arbitrary content, and don't have the limitations that Last-Modified timestamps do. Newer browsers tend to implement ETags.

to use this technique simply include a reference to a 1x1 transparent gif somewhere in your page, and then combine it with one of the other intra-session techniques. you'll be able to recognize browsers across sessions by studying the timestamp and/or ETag they include in their requests.

Indirect Last-Modified and ETag (inter-session tracking)

a variation on the previous method -- since many "privacy filters" will thwart 1x1 transparent gifs.

suppose your web page A includes a reference to a frameset or style-sheet B. the frameset or style-sheet B includes a URL munged reference to another resource (such as a background gif, or subframe) C.

when your server sees a request for B, respond in one of two manners:

  • if the request does not contain an If-Modified-Since or If-None-Match then construct a response which contains a URL munged reference to C; and include an indefinate Last-Modified and/or an ETag.
  • if the request contains an If-Modified-Since or If-None-Match respond with 304 (Not Modified).

every time the browser starts up it will request the URL munged C -- and because we've carefully arranged for a private copy of B for each client we can use the URL munging in C to track the client across sessions. combine with one of the intra-session methods for a complete solution.

Javascript (intra-session tracking, page-customization)

i don't do javascript, but i'm pretty sure it can be used to do intra-session tracking and page-customization. in particular i believe it's possible to hide URL munging with javascript.

Statistical IP analysis (intra-session tracking)

other than in the case of proxies and NAT, an IP address tends to be "sticky" to one user for a short window of time. and even if it's a dynamic dialup address, unless you're a hugely popular website, you won't tend to see distinct users from the same dynamic address within even a day of logs.

<base href>

perhaps it's possible to use <base href> for tracking -- i haven't investigated this any further.

proxies, NAT

this is where i'd study the effect of various proxies and NATs on the above techniques, except that i'm too lazy. i feel confident that there are techniques which communicate per-client information even through proxies (SSL in particular is one such method -- combine it with some of the others for a full solution).

i would also need to consider the AOL proxy "spray" effect where the same client's requests are spread over multiple proxies.