“R” is for Reflection

R’s popularity has increased substantially in recent years … the R community is noted for its active contributions in terms of packages.

Wikipedia: “R (programming language)”

“R” is for Runaway

“R” is a programming language and environment for data science, and at the time of writing (July 2015) it is thriving. It’s free and it’s open, and those are things that we Revolutionaries appreciate. You might think that data science programming is a somewhat niche area, and indeed it is: but this niche is buzzing. (We’ve been using it to crunch statistics for a product we have been working on, Echo Reflect – see http://www.echoreflect.com.)

Now, “success”, for an Open Source development, is all about take up. It’s one thing to create great technology; it’s another thing to have it found and valued by others. It’s yet another thing to inspire others to embrace your project and to start contributing to it. In these terms, R is massively successful. The Comprehensive R Archive Network (CRAN) today holds over 6,800 packages that extend the language in  general and specialist ways.

But here’s the thing: of these 6,800 extensions to R, more than 5% have been added or updated in the last 2 weeks. Other measures of take-up (surveys, polls, job ads) say the same thing: R is hot right now. And success breeds success: the more activity there is, the more good work will be done, and the more incentive there is to build more good tools. All good, right?

“R” is for Relentless

But coping with so much analytical innovation poses its own problems. It seems that to keep on top of what is out there, you need to be looking at twenty to thirty new packages every day. And, to avoid reinventing the wheel, you need to know what’s already been done: the perfect solution to your problem may be somewhere in those thousands of packages already developed. You know that most of those developments will NOT be relevant to your specific interest, since the majority address niche problems, and we’d really rather not even take time to look at them.

On the other hand, some packages are indispensable (e.g. “ggplot2” for graphing data or “data.table” for efficiently dealing with huge datasets), and have essentially become part of the language themselves. These are the ones we’re after: the packages that are proving themselves in the wild and gaining good reputations.

You’re not alone in trying to deal with this flood: others have gone before and offer the benefit of their experience. For example:

  • CRAN (The Comprehensive R Archive Network) provides a “Task View” which groups packages of special relevance to specific tasks or functions, e.g. “ClinicalTrials”, “Finance” or “MachineLearning”.  Helpful though these are, they are manually curated, and even the specialist curators are clearly not coping with the flood (the “MachineLearning” Task View hasn’t been updated for over 6 months – in any other field this would be impressively recent, but when the technology is refreshing itself at 5% every 2 weeks…)
  • Sites like StackOverflow are extremely helpful in pointing to specific solutions and hinting at the packages in regular use.

Valuable though these human contributions are, one has to wonder if data science itself doesn’t have a role to play here. Wouldn’t it be cool to turn R on itself? Maybe in this way we can surface at least a little automated guidance as to where to be looking. So, maybe there is some mileage in:

“R” is for Reflection

I’m not the first to think along these lines:

But I think one can go further: how about these?

  • Measure how important a package is by how often its download link appears on the Web – in other words, how many Google hits does each package’s download location attract?
  • Measure importance by building a dependency graph: how many other packages build upon this one?
  • Use machine learning techniques to train an algorithm to classify packages.

So, I’m going to have some fun turning a mirror on R itself: I’ll post the results of these experiments as they come through. Stay tuned!

Fingerprints, Footprints and Googlewhacks

In TV crime dramas the plot is often propelled forward by the discovery of DNA evidence, which is “fingerprinted” instantly leading to the identification of the individual at the scene of the crime. When available, a DNA fingerprint is a very strong indicator of identity, and the drama may then focus, not on who was there, but on why and how the DNA got there. Perhaps the DNA donor played an innocent role at the crime scene, or perhaps the evidence was contaminated while in the hands of the police.  But perhaps the perpetrator may have been careful, using gloves, and wiping down surfaces, or they may even have faked the evidence, planting DNA to try to frame an innocent party.

Now, when the TV cops don’t have a strong indicator, like DNA or even an actual fingerprint, they may have to rely on combinations of weaker indicators of identity.  The footprint in the rose bed outside the study window, the brand of the cigarette stub in the ashtray, the tyre tracks from the getaway car, the fuzzy image from the security cam.  Put enough of these weak identifiers together and you might just have enough to pinpoint a suspect, and bring them in to help with enquiries: you might even get a confession.

Well, I might be familiar with the cliches, but I’m not about to start writing teleplays for crime dramas.  However, we have been doing a little detective work of our own:

Echo Protect

The Echo Protect project is about detecting potential fraud: spotting suspicious behaviour from visitors to clients’ websites, and from staff using their internal systems.  It’s concerned with patterns of behaviour, and with early detection of suspicious activity. One of the challenges facing the project is the reliable and speedy identification of website visitors and users.  In other words, when someone hits a client’s website, we want to be able to quickly detect if this is a known visitor, and in particular, if there is any cause for concern.  If red flags are raised, Echo Protect can help the client intervene in some way, and deflect the user from any intended fraud, before the fraud has been committed.  So we’re looking for some way to recognise if we’ve seen a visitor before.

Strong Indicators

There are some strong indicators that might be available for us to use, the equivalent of DNA or actual fingerprints.  If the user actually signs in with a user name and password, then that supplies a strong indicator of identity. (But strong is not the same as reliable: for example, the password might have been hacked). There are other strong indicators:

  • A MAC address is a strong identifier of a physical device (but it’s generally not available without user consent/involvement)
  • Any cookies used by the website are strong identifiers of browser/device (but a user can refuse cookies)
  • Browsers can be fingerprinted without using cookies, and provide surprisingly strong indicators of browser/device – up to 94% of browsers can be uniquely fingerprinted.  But these fingerprints are computed using factors that may change over time (eg browser version)
  • IP addresses are strong identifiers of internet connections (but are generally not long-lasting)

All these are extremely helpful in the right circumstances, but all have their limitations.

BUT: even if we have 100% identification of, say, the device connecting to the website, that’s not the end of the story.  The fraudster might be using multiple devices and the patterns of fraudulent behaviour might be visible only when the activities of multiple devices are correlated.  So it makes sense to rely, not just on one perfect indicator, but on multiple indicators. When we see multiple indicators all pointing in the same direction, the confidence in our identification soars.

We have been taking data from an e-retailer who receives visitors to their website from all around the world, (most of whom do not sign in to the website, but merely visit) and we’ve been experimenting with different routes to positive identification. We do have some strong indicators: a locally stored token (equivalent to a cookie), and a browser fingerprint.  We are reasonably sure that when we see these indicators, they point pretty conclusively to a particular browser on a particular machine.  But these strong indicators naturally do NOT help us when one user is using multiple devices, say, a tablet and a laptop.  So we are starting to look at combinations of weaker factors: footprints and tyre tracks rather than fingerprints.

Weak Indicators

For example, let’s consider the property “city” as returned by an IP-based location service. On the face of it, this is a pretty weak identifier.  The website in question gets traffic from all around the world, but most of the traffic is from the UK.  To see a user visiting from London is next to useless in terms of positive identification.  But (to take a real example), what if the city is Belo Horizonte in Brazil? For this website, that is an unusual city. To be sure, Belo Horizonte is a city of 2.5m people and for a Brazilian website, or for a website with a lot more traffic, this would be unexceptional.  But in our case, multiple accesses from Belo Horizonte would at least suggest a tentative identification, something worth noting.  It seems the strength of “city” as an indicator, depends on what particular value it carries.

We’ve been looking at other “weak” indicators. For example:

  • the network through which a visitor is connected (eg BT, Sky, etc);
  • the sort of activity the user gets up to when connected (browsing, buying, home-page only, etc);
  • the type of device being used (iPad, laptop, Android tablet, etc).

Combinations

Sometimes these indicators can prove to be fairly strong in their own right (sorry Google, but “Chromebook” is much rarer than “iPad”, and so is a much stronger indicator). Mostly, though, it is combinations of these indicators that build up a sufficiently strong identification to be useful.

For example, in our dataset, the combination of “city=chichester” and a particular network provider is enough to establish uniqueness.  Neither factor on its own is strong, but taken together they are strong enough to suggest we have identified a unique individual.

You may know about Googlewhacking: the idle pursuit of typing words into Google search and trying to find combinations that yield just one result.  What we’re doing is a bit like that, trying to see which of the data values we hold, and in which combinations, will find the unique visitor.

There are no guarantees here, just an alignment of indicators.  A moderately competent fraudster might take the trouble to ensure that he cycled his IP address and disguised his browser user agent string (easily enough done).  But we might still be able to identify him as a repeat visitor by virtue of his location and network provider.   In our “chichester” example, all of the following combinations of indicators point to the same case.

  • location combined with network
  • activity profile combined with device type
  • browser fingerprint
  • IP address
  • locally stored token (cookie)

If we estimated each of these approaches had (say) a 90% accuracy, then agreement of 2 out of 5 of these indicators would indicate only a 1 in 100 chance of a false identification, grounds enough to pay attention. For example, in this way you could readily pick up someone opening a duplicate account, whether with malicious intent, or innocently, in which case a gentle intervention could guide the customer to his/her previous identity.

We’ve not yet finished with exploring the potential of these approaches: for example, it’s looking like we can draw useful inferences by considering the relative strength of the indicators and picking out different patterns in the way in which visitors’ set-ups are configured (one user in two locations, multiple users behind a single router, etc). Every day we seem to find new avenues to explore.

A Low Cost, Resilient Production Environment in the Cloud

We reached another milestone last week.  We completed the deployment of a production site for Echo Central.

Revolutionary Systems has been working on the applications for a new venture, Echo Central.   As their strapline says, Echo Central “brings the best of modern customer service to your website”.   It provides customer chat facilities, feedback forms and ways of delivering targeted help material to website visitors.   Website owners who sign up to Echo Central will expect a high standard of service, because the use of Echo Central will reflect upon their own websites.  In particular Echo Central functionality will need to be highly available, reliable and performant (also stylish, intuitive to use, multi-lingual and so on, but for now we’ll focus on the technical aspects).

The Brief

Broadly, the systems for Echo Central need to encompass the following:

  • A retail website for enlisting new customers,
  • A customer dashboard website for those who’ve signed up,
  • User management services to check permissions and privileges,
  • Subscription management services to look after account details and settings,
  • Point solutions for each independent capability of the service (i.e. chat, feedback, help, etc…),
  • An analytics engine (the magic!) to generate added-value insights for customers based on their website visitors’ activity,
  • Databases to support all of the above, and
  • An Enterprise Service Bus to “glue” it all together.

We’re strong supporters of service oriented architectures and prefer an overall solution where the individual components are deployed independently, communicating efficiently with each other, rather than all running on one big box.  This approach means that to support “all of the above”, we need at least seven servers to start with.  In fact, to achieve a properly resilient system, we need a few more that that.

The Low Cost of Cloud Computing

We, at Revolutionary Systems, are big fans of Amazon Web Services.  They have truly pioneered a revolution in computing.  Our experience of AWS dates back the better part of a decade, and over that time they’ve astounded and delighted us time and again.  Amazon keep on delivering well-conceived cloud services, and on a constantly falling price curve.  Back in 2006 when their Elastic Compute Cloud (EC2) was launched, a “small” virtual server was available for $0.10 per hour: now a “t2.small” virtual server will set you back just $0.028 an hour – 72% cheaper without even allowing for inflation.

Despite long experience with AWS, we’d never before run a production environment on AWS “spot instances”.  Spot-priced instances are AWS virtual servers where the “rental” price is allowed to vary with supply and demand, but on balance is expected to be much cheaper than the “on-demand” instances (where the price is fixed).  Currently EU-based “medium-sized” on-demand instances are charged at $0.077 per hour.  That’s pretty cheap: around £35 a month (assuming a 24*7 up time).  But on a spot priced basis, the same computing instance costs: $0.0101 per hour.  That’s right, just a little more than one US cent for an hour’s use… £4.50 a month (and that price has been very stable recently).  You can see the appeal.

Building for Resilience

So, what’s the catch?  Well, when you request a spot instance, you specify a maximum price to pay.  If the price spikes up (and it does, occasionally) you will lose the server instantly.  (So, it’s a bad idea to host your database on a spot instance!).  To effectively use spot instances, your applications must be engineered for resilience – to be able to survive a sudden outage.  Services must failover to alternate hardware, preferably in a different location.  Then, as soon as possible, the failed servers must be restored.  And this must happen automatically, without interruption.  In other words your applications need to be engineered for true cloud operation.

But here’s the thing.  A web service like Echo Central’s needs to be engineered for resilience anyway.  Even if we had dedicated hardware that we could walk up to and touch, regardless of what the hardware is or where it is, it can fail.  So automatic failure recovery would always have been essential.  The use of spot instances just means that the failure provisions actually come into play from time to time, and this just increases our confidence in their effectiveness.

We’ve set up the spot instances in two separate and independent availability zones: if for any reason (such as the price spiking) any of them fail, the service will still continue in the second availability zone. And with automatic scaling, the capacity of the system can be rapidly restored.

Likewise, our database is replicated across two availability zones and the two versions are kept in sync.

Going Live

Now Echo Central is going into production trials (restricted beta for the moment), and we will be able to see just how this production environment shakes down.  Volumes in the initial trial stages will be low, of course, but we have designed things to scale seamlessly with volumes, and we’ll be wanting to test that scaling once we are happy that the functionality is right.

All in all, we believe we have engineered a production environment that any large enterprise would be proud to have in place.  And the cost?  Ah, that would be telling, but it is certainly an order of magnitude cheaper than any comparable architecture I’ve seen.  So much so that, for Echo Central, the extremely low operating costs of their systems will be a major factor in allowing them to compete effectively in a very competitive market.

You Got An Ology?

He gets an ology and he says he’s failed… you get an ology you’re a scientist…

(Maureen Lipman as Beattie, in the 1987 British Telecom TV ad, having just been told by her grandson Anthony that he’s flunked his exams, passing only pottery and sociology.)

Old enough to remember that? relive it here:  Original Ad with Maureen Lipman.

Well, we have an ology too … Objectology.

Some Theoretical Background:

“Ontology” is a term that the Computer Scientists stole from the Philosophers.

Wikipedia advises us that the philosophical definition goes like this:

  • Ontology is the philosophical study of the nature of being, becoming, existence, or reality, as well as the basic categories of being and their relations. Traditionally listed as a part of the major branch of philosophy known as metaphysics, ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.

While information science defines it thus:

  • An ontology is defined as a formal, explicit specification of a shared conceptualization.  It provides a common vocabulary to denote the types, properties and interrelationships of concepts in a domain.

We at RevSys come from a software development background that places great importance on re-use.  We hate doing the same thing over and over, so we always look for the repeating patterns in what we do. Any halfway decent developer can analyse a business domain and come up with a data model or a class model for it, and build it using conventional tools.  But sometimes these systems are not so easy to change later – we’ve found it better to craft code that is more generic and as far as is practicable, to treat the specifics of the domain as configuration, which can easily be changed.

So I’d been doing some thinking about what a next-generation generic record-keeping system would look like – what fundamental concepts would need to be supported, and how these might fit together.  In fact,  the philosophers’ definition of ontology pretty much summed it up:  “what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences“.    Since it would be misleading to call this an ontology in the Information Science sense, we named the project Objectology – the science of objects.

Let’s Get Practical

We’re currently building systems for Echo Central, which is a web-based subscription service aimed at providing a suite of tools to help online providers deliver excellent customer experiences.  As part of our service-oriented architecture, we needed a record-keeping system to hold details of subscribers’ accounts and preferences.

We’re moving fast on this, so our initial impulse was just to crank out a specific solution for this relatively simple domain.  But we just couldn’t bring ourselves to do it: we’d already been throwing around ideas for the Objectology engine, so we swallowed hard and decided to make a first stab at Objectology, with just enough features to do the “subscription-manager” job.  Adam took the lead in building it, keeping the design lean and clean, and very powerful.

What did it mean to use Objectology in this context?  It meant that we could model the business data for the Echo Central subscriptions as object templates, written as  XML or JSON documents, and not embed the “subscription” concepts in the Java code.  The configuration files are loaded into the Objectology engine and this makes it comes to life, exposing a web services interface that looks as if it were hand-crafted to support subscriptions, with “inbuilt support” for products, accounts, subscriptions, etc.    In truth, the time taken in doing it this way was probably not much more than the “hard-coded” approach we had at first contemplated, but building Objectology was way more interesting and intellectually stimulating: it was more fun.

And how’s that working out?

The model-driven approach has already justified itself.  It’s inevitable in any software development that requirements change, but this is especially true when it’s in support of a business that is itself a startup and the very business offerings are not yet finalised.  In pulling things together for Echo Central, we’ve changed the way we look at products, accounts, subscriptions and preferences a few times, and what this has meant has been soft change to the Objectology configuration, and not rewriting of code.

But the real clincher came 2 weeks ago, when we looked at where we had reached, and we decided that Echo Central would need a whole new service: one to keep track of user feedback submissions, a kind of mini-workflow service.  We talked it through and around 4:30pm Adam skyped: “I think we could very easily and quickly get simple management of a feedback item going in objectology” and he, Tim and I reckoned it was worth a go.   The very next day at 6pm, he reported “the workflow thing is working nicely. all for about 100 lines of xml“.   Adam had modelled and implemented the feedback record-keeping system in Objectology in just over 1 day (and it wasn’t even all he had been working on!).

This is a great example of how we like to work at Revolutionary Systems.  We start with well-thought-through principles, and then we embody those principles in practical but general-purpose frameworks, which together with model-based configuration delivers systems meeting the precise requirement, but in a way that has adaptability and re-use at its core.

And Now For Something … odDball

When, according to habit, I was contemplating the stars in a clear sky,
I noticed a new and unusual star, surpassing the other stars in brilliancy.
There had never before been any star in that place in the sky.

 Tycho Brahe

We often build organisations around what is usual: we streamline processes to optimise normal workflow; we look to standards and norms when measuring performance or growth; we tune ourselves to the expected.  And we define “normal” by way of averages and conventionally accepted standards.  Too often we often ignore variability and extremes.

It’s essential, of course, to handle “business as usual” as efficiently as we can, but sometimes what can make the real difference is how businesses handle the unusual.  Information theory tells us that a message in line with expectations contains very little information: by contrast, the unexpected is information-rich. Looking for outliers and anomalies can be very profitable: we might detect and deal with unexpected threats, or we might be able to seize fleeting opportunities.

Take some examples:

At Revolutionary Systems we have been developing services for Echo Central:  services to allow clients to enhance their websites by creating a web presence that is welcoming and which sends out signals that this is a good and safe place to do business.  Echo Central does this by opening up better communication channels between site owners and site users, and this means understanding the browsing customers: who they are, and how they behave.

To help with this, Revolutionary Systems has developed an analysis engine we call “odDball”: technology to look for interesting anomalies.  Anomalies such as:

  • The web page that didn’t render properly, or took a very long time to render.  Perhaps the client browser was the victim of malware, or perhaps the customer just has a very bad internet connection.
  • The website user whose browser fingerprint looks a little unusual.  Perhaps this is a potential customer from an unusual location, or using unusual technology.  Maybe signs of a new market, or maybe your website has just been visited by an automated bot.
  • The customer who keeps asking for repeated quotations with slightly different options each time.  Maybe just someone struggling to choose and in need of help, or maybe someone actively searching for website vulnerabilities.
  • The product line that enjoys a sudden surge of sales. Maybe the start of a new trend, or maybe someone’s fat fingers entered the wrong price into the online catalogue.
  • The sudden deterioration in payment acceptances by your outsourced payment service.  Maybe you’re being visited by a less creditworthy crowd, or maybe your payment service is suffering failures.

As in these examples, picking up early warnings of unusual activity alerts you to threats and opportunities.

How does odDball help?

Our analysis engine odDball is designed to detect all of these kinds of business anomalies,: it receives streams of incoming signals (which may come from online activity, system logs, databases, etc), and it categorises and tags them and so identifies the ones that are out of the ordinary, the ones that look odd.

Going further:

  • How should we define what is “odd”?  odDball uses a mixture of machine-learned, and hand-crafted rules.  As more and more cases are seen, the rules are sharpened for keener recognition of opportunities and threats.
  • What is “normal” and what is “odd” may be context dependent.   A transaction that is normal in the daytime may look suspicious in the early hours of the morning.
  • Activity that may in itself look quite normal, may still create a pattern that is abnormal (for example, signals may arrive in rapid succession, suggesting a non-human origin).
  • Changes in volumes or in the balance of interaction types (“we should be selling more of these on a Monday morning”)

So what does odDball actually do?

  • Firstly, it accepts a stream of incoming signals, profiles and tags them according to rules, in so doing, identifying the ones that don’t match known profiles.
  • It may transform the incoming signals, deriving useful metrics from raw measurements.
  • It sends the signals to one or more analysis rule sets for different purposes.  In the case of Echo Central, this may include:
    • A “base” rule set looks at metrics that are of general applicability: the browser/version and platform the users is using; page sizes and other page metrics, locale and language etc;
    • One or more extension rule sets belonging to the account holder and perhaps specific down to the web page involved: time to load, page width and depth etc;
    • Rule sets that examine which if any javascript libraries may have run, or javascript errors that may have occurred.
  • It saves the signals for later processing or retrieval.
  • It lets you define bins that hold the results you’re interested in, and to retrieve the content of the bins in various ways.
  • It can aggregate signals in various ways, to expose patterns and trends:
    • Signals clustered in time can be aggregated into episodes which themselves can form their own signal stream;
    • Signals summarised over time slices can be analysed for changes in volumes and balance;
    • Signals can be archived for historical comparisons.

odDball plays an important role in the Echo Central services.  But at Revolutionary Systems, we like our software to be versatile: our instinct is always to look for generic solutions to specific problems.  So although odDball is perfectly suited to the analysis requirements of Echo Central, it can be used in many other contexts: fraud detection through analysis of the traffic through enterprise systems; monitoring of system performance (in quality, reliability and throughput terms) through log file analyses; and many other possibilities for seizing opportunities and responding rapidly to potential threats.

The initial development of odDball is just reaching completion: it’ll be ready in time for the soft launch of Echo Central later this year.  We’ll also be working on user-friendly tools to specify the odDball rules, and striking ways to visually present the insights it uncovers.

Launching our legal docs repo

A main aim at Revolutionary Systems is to create an ecosystem of components, tools, methods and other assets to help reduce the time to market for tech startups.  Today is an important, if small, milestone in that journey, as we officially launch our first element of that vision.  Before you get too excited, let me quickly add that this isn’t some amazing code base, or wonderful tech recipe (they will come, promise 🙂 ).  But it is something that every tech startup wishing to sell an internet service will need…

… legal documents (well, I did say it was a small milestone).

Today, then, we announce the launch of our GitHub repo of legal documents – find them here: github.com/revolutionarysystems/legal-docs .

Right now we have two document templates there (with more to come):

  • A Terms of Service – designed for companies wanting to launch an internet service to which users must sign up.
  • A Privacy Policy – suitable for websites and web SAAS applications, covering personal data, cookies, etc.

Just fill in the blanks with your own details and you should be good to go.  Each document is available as an html file and as raw text.

We are releasing all documents in the repo under a Creative Commons license, allowing anyone to use them and adapt them, for free.  We have been greatly helped in producing these by the groundwork of the folks at Automattic who, for their excellent WordPress service, have put their legal documents under a Creative Commons license.  Many thanks guys.

“Every little helps” they say, and our hope is that this will save tech startups some time and money which otherwise they would have had to spend.  And in doing so, we like to think that we nudge upwards (albeit ever so slightly) the percentage of startups that actually make it to a successful launch.  The world needs better tech.

(Disclaimer – We are not lawyers, and these documents have not been tested in any court cases.  Use of them is entirely at your own risk.)