In TV crime dramas the plot is often propelled forward by the discovery of DNA evidence, which is “fingerprinted” instantly leading to the identification of the individual at the scene of the crime. When available, a DNA fingerprint is a very strong indicator of identity, and the drama may then focus, not on who was there, but on why and how the DNA got there. Perhaps the DNA donor played an innocent role at the crime scene, or perhaps the evidence was contaminated while in the hands of the police. But perhaps the perpetrator may have been careful, using gloves, and wiping down surfaces, or they may even have faked the evidence, planting DNA to try to frame an innocent party.
Now, when the TV cops don’t have a strong indicator, like DNA or even an actual fingerprint, they may have to rely on combinations of weaker indicators of identity. The footprint in the rose bed outside the study window, the brand of the cigarette stub in the ashtray, the tyre tracks from the getaway car, the fuzzy image from the security cam. Put enough of these weak identifiers together and you might just have enough to pinpoint a suspect, and bring them in to help with enquiries: you might even get a confession.
Well, I might be familiar with the cliches, but I’m not about to start writing teleplays for crime dramas. However, we have been doing a little detective work of our own:
Echo Protect
The Echo Protect project is about detecting potential fraud: spotting suspicious behaviour from visitors to clients’ websites, and from staff using their internal systems. It’s concerned with patterns of behaviour, and with early detection of suspicious activity. One of the challenges facing the project is the reliable and speedy identification of website visitors and users. In other words, when someone hits a client’s website, we want to be able to quickly detect if this is a known visitor, and in particular, if there is any cause for concern. If red flags are raised, Echo Protect can help the client intervene in some way, and deflect the user from any intended fraud, before the fraud has been committed. So we’re looking for some way to recognise if we’ve seen a visitor before.
Strong Indicators
There are some strong indicators that might be available for us to use, the equivalent of DNA or actual fingerprints. If the user actually signs in with a user name and password, then that supplies a strong indicator of identity. (But strong is not the same as reliable: for example, the password might have been hacked). There are other strong indicators:
- A MAC address is a strong identifier of a physical device (but it’s generally not available without user consent/involvement)
- Any cookies used by the website are strong identifiers of browser/device (but a user can refuse cookies)
- Browsers can be fingerprinted without using cookies, and provide surprisingly strong indicators of browser/device – up to 94% of browsers can be uniquely fingerprinted. But these fingerprints are computed using factors that may change over time (eg browser version)
- IP addresses are strong identifiers of internet connections (but are generally not long-lasting)
All these are extremely helpful in the right circumstances, but all have their limitations.
BUT: even if we have 100% identification of, say, the device connecting to the website, that’s not the end of the story. The fraudster might be using multiple devices and the patterns of fraudulent behaviour might be visible only when the activities of multiple devices are correlated. So it makes sense to rely, not just on one perfect indicator, but on multiple indicators. When we see multiple indicators all pointing in the same direction, the confidence in our identification soars.
We have been taking data from an e-retailer who receives visitors to their website from all around the world, (most of whom do not sign in to the website, but merely visit) and we’ve been experimenting with different routes to positive identification. We do have some strong indicators: a locally stored token (equivalent to a cookie), and a browser fingerprint. We are reasonably sure that when we see these indicators, they point pretty conclusively to a particular browser on a particular machine. But these strong indicators naturally do NOT help us when one user is using multiple devices, say, a tablet and a laptop. So we are starting to look at combinations of weaker factors: footprints and tyre tracks rather than fingerprints.
Weak Indicators
For example, let’s consider the property “city” as returned by an IP-based location service. On the face of it, this is a pretty weak identifier. The website in question gets traffic from all around the world, but most of the traffic is from the UK. To see a user visiting from London is next to useless in terms of positive identification. But (to take a real example), what if the city is Belo Horizonte in Brazil? For this website, that is an unusual city. To be sure, Belo Horizonte is a city of 2.5m people and for a Brazilian website, or for a website with a lot more traffic, this would be unexceptional. But in our case, multiple accesses from Belo Horizonte would at least suggest a tentative identification, something worth noting. It seems the strength of “city” as an indicator, depends on what particular value it carries.
We’ve been looking at other “weak” indicators. For example:
- the network through which a visitor is connected (eg BT, Sky, etc);
- the sort of activity the user gets up to when connected (browsing, buying, home-page only, etc);
- the type of device being used (iPad, laptop, Android tablet, etc).
Combinations
Sometimes these indicators can prove to be fairly strong in their own right (sorry Google, but “Chromebook” is much rarer than “iPad”, and so is a much stronger indicator). Mostly, though, it is combinations of these indicators that build up a sufficiently strong identification to be useful.
For example, in our dataset, the combination of “city=chichester” and a particular network provider is enough to establish uniqueness. Neither factor on its own is strong, but taken together they are strong enough to suggest we have identified a unique individual.
You may know about Googlewhacking: the idle pursuit of typing words into Google search and trying to find combinations that yield just one result. What we’re doing is a bit like that, trying to see which of the data values we hold, and in which combinations, will find the unique visitor.
There are no guarantees here, just an alignment of indicators. A moderately competent fraudster might take the trouble to ensure that he cycled his IP address and disguised his browser user agent string (easily enough done). But we might still be able to identify him as a repeat visitor by virtue of his location and network provider. In our “chichester” example, all of the following combinations of indicators point to the same case.
- location combined with network
- activity profile combined with device type
- browser fingerprint
- IP address
- locally stored token (cookie)
If we estimated each of these approaches had (say) a 90% accuracy, then agreement of 2 out of 5 of these indicators would indicate only a 1 in 100 chance of a false identification, grounds enough to pay attention. For example, in this way you could readily pick up someone opening a duplicate account, whether with malicious intent, or innocently, in which case a gentle intervention could guide the customer to his/her previous identity.
We’ve not yet finished with exploring the potential of these approaches: for example, it’s looking like we can draw useful inferences by considering the relative strength of the indicators and picking out different patterns in the way in which visitors’ set-ups are configured (one user in two locations, multiple users behind a single router, etc). Every day we seem to find new avenues to explore.