Welcome to the first of a regular series of posts from the technical team at Growth Intelligence.
We have called this blog ‘A View from Inside the Demand Sphere’ and in it we are going to explore a range of technical issues, across both Data Science and engineering, to give a bit of insight into how we deliver predictive marketing to our clients.
At the heart of Gi is an algorithm that predicts “who” will become our clients’ next customers. It analyses our clients’ historical data in their CRMs. That is, the list of positive and negative responses they have received to previous outbound marketing activities. However, it is not always possible for our clients to provide us with the negative responses as they may not have been recorded.
In such a scenario, we take a different approach to model building, and compare the positive outcomes to the general background (that is, all other companies in the UK). I touched briefly on this approach in the context of rapid feature generation in one of our recent talks at an Elasticsearch Meetup (@elastic_london) hosted by the team behind Red Badger (@redbadgerteam) and I thought I would go into a bit more detail here.
Inside the Demand Sphere™* we have the website of every company. The text of all of these websites is extracted and cleaned up and then indexed into Elasticsearch. This allows us to quickly calculate the relevance of keywords and phrases to the companies in the training set and, by using the Elasticsearch’s TermVectors API, we can calculate the relevance of those same keywords and phrases to the background (strictly this is an Elasticsearch shard, chosen at random).
After visualizing this, you can see that most words and phrases are as relevant to the training set as they are to the background. This is to be expected.
However, if you explore certain topics within the set of keywords, it quickly becomes clear that some of them are much more relevant to the training set than the background. The figure below shows the relative relevance of country names and major economic areas in the training set versus the background.
One of the themes of the talk we gave about Elasticsearch talk was about how we use it to bootstrap models and rapidly explore possible features. The relative relevance of these keywords is a great example of how we can do that over a training set that only includes positive examples.
There is lots more we want to do to further this approach. We currently use the standard TF-IDF measure to calculate relevance – it would be great to see how something like the BM25 measure would compare. We are also looking at exploring other topics, and generating those topics using something like LDA. We are looking forward to sharing our progress on this approach, and other work in future posts of ‘A View from Inside the Demand Sphere’.
*Demand Sphere™ is a network of companies linked by the probability of them entering a commercial relationship.
Alex Mitchell, Data Team Lead (@data_alex)