Betfair is the largest on-line betting exchange in the world and has seen tremendous growth in the past few years. One of the fundamental problems that comes with growth is that the technology that drives the exchange has to scale in line with not just current growth rates but also the forecast growth over next X months/years.
What also complicates the matter is that growth is a function of many variables, including but not limited to number of users, jurisdictions, market offerings and the list goes on. So the goal post is constantly moving for us. Whilst it is a massive challenge, in some ways it is also a nice problem to have. As noted in a previous post, the current exchange back-end architecture, whilst still proving its worth everyday, puts some limitations on rapid scale outs and on our ability to deliver new capabilities to maintain the edge in a highly competitive market. We have been slowly migrating the entire back-end architecture to be event driven, and we recently achieved another milestone in this path. A new stream of core exchange activity, that provides the lowest level of transactional activity ticks that occur in the exchange, in near real time, was deployed live after some painstaking work over a few months. This is a significant step forward for us as it essentially opened up options for us to migrate a lot of our internal service applications to source the exchange activity data from the stream, instead of relying on some legacy sources which were highly reliant on a very heavily used database. More significantly, it also allows us to build similar capabilities exposed to our end users. The detailed benefits of this, can probably be discussed in another entry, but this is no doubt a step change for us.
It wasn’t all smooth sailing, as essentially we are trying to build a single causal stream of all exchange activities, which are being generated by different sources in different formats and are not always in an elegant consumer friendly way.
Part of the complication was also due some of these source systems themselves going through some transition and migrating to a newer platforms, so we constantly had to catch up. Some of the source systems were legacy systems that hadn’t changed for years, which also meant that some of the behaviours were only embedded deep in their code base, rather than being explicitly documented for consumers. Some of the data was also incomplete, so where ever possible we had to go back and fix the source systems, whilst at other places, we had to fill in the gaps, in order to build a completely coherent stream.
So, to ensure successful integration with legacy systems where the requirements/behaviours are not explicit,we integrated early, and integrated often. We developed the project iteratively, where the first aim was to build something that is a minimal representation of the application, which would allow us to integrate end to end. And then we iterated over it, over a few months, incrementally adding new features, while constantly validating the application with external systems. To be able to effectively do this we need an environment that is representative of the real production environment where this will be finally deployed to.
At Betfair, we have multiple levels of integration environments,through which application is promoted, each successive level being progressively more representative of production environment. We also have a dark staging environment, which allows us to deploy into a production like set-up and validate the application with live production data but without actually affecting anything in production, i.e. the outputs being completely isolated from production traffic. This allows us the freedom soak test our services in near production conditions over a period of time, before rolling out to production, thus reducing the risk and providing constant feedback about application’s behaviour in real world.
Kafka is our chosen technology for building some of the key areas of our event stream backbone. One of the key advantages provided by Kafka, among others of course, is the ability to partition and distribute the data for a given topic across a number of partitions where the ordering is guaranteed within a partition. The partitions are also distributed across a number of broker hosts, so they can be produced to and consumed from, with a high degree of parallelism and provides for high overall throughput. As we started the exploratory roll-outs and started measuring the results, it became apparent to us that a monolithic model of message production across the partitions could quickly become obsolete in a few months time, as we see more growth. So we took this a step forward and were able to achieve much higher performance than what was available out of the box. We refactored the message producer to be able produce to multiple partitions independently where each partition effectively had a producer pipeline that was totally isolated from others. Crucially, this meant the set of producer pipelines for a given topic could be either co-located within a single VM or distributed across multiple producer hosts. We run a cluster of producer hosts, using Zookeeper to manage leader elections for each partition. This way we built up the flexibility to be able to run the producers in a monolithic mode for low throughput topics with smaller number of partitions and in a distributed mode for high throughput topics where we had to distribute the data across multiple partitions.
Essentially this means as we see growth with increased user base and increased activity, we can leverage the CPU and memory capacities across a number of hosts as we start to scale out to preserve high throughput and low latency message delivery that is critical to our functioning. This is especially important for us, as we are looking to grow and expand to new jurisdictions and the load profile keeps changing every few months.
Along with this reviewed our data model and pruned areas that were obsolete. We had started off the project with, in hindsight, an overly broad remit trying to model everything in the world and more on the streams. This meant we had ended up with a rather fat model and focussing on delivery we had sidelined pruning the model as requirements changed or dropped off. The learning for us was that it’s always better to start small and augment things as new requirements come into play, while of course ensuring the model is extensible without breaking existing contracts, and this applies to vast majority of use cases. Of course we all knew this but sometimes most obvious things get overlooked when trying to cater for complex scenarios. This brought us some efficiency by reducing the memory footprint of each producer pipeline. We also tuned garbage collection quite extensively to make sure the pauses are kept to a minimum, given the memory churn we go through under normal and peak loads, and up to 20x the load. We also did explore lower level memory tuning using weak reference based caches for very commonly used data, and achieved a degree of efficiency. So this is a worthwhile option too if your application has a relatively large number of long lived objects that are created in disparate areas of code..
We also looked at the data in a different light and achieved another level of efficiency. Our model consisted of mutually exclusive datasets, i.e. data could be sliced such that a subset of data would represent a complete dataset in itself which had a unique set of consumers. So we added the capability to be able to produce each subset independently and as such could be scaled independently without affecting other datasets. This represents a crucial capability for us, as the exchange expands into new geographical markets, we can keep the data isolated and performance tuned for each jurisdiction separately. This allowed us to iteratively roll-out the capability to different jurisdictions quickly and safely. The consumers for each jurisdiction are vastly different and as such have different performance constraints, and it was beneficial to keep them isolated.
Lastly, a note about partitioning. As we know, Kafka caters for near horizontal scalability by allowing seamless addition of new servers into a cluster and crucially expansion of number of partitions within a single topic. Now this presents a challenge to us because we generally distribute the data across the partitions based on the hash of a simple key and the moment we increase number of partitions we will start leaking partial data into new partitions. We worked around this by isolating the message partitioning logic into a set of pluggable policies, that follow a time based activation. Meaning only data that comes into existence after a set point in time can be allowed to enter new partitions, while all existing data is sticky on existing partitions. This allows us to essentially scale without affecting any of existing consumers. Its also essential to ensure that data does not move across partitions. Without going into the details of our data model, and the transactional nature of the data, the message keys tended to change under certain valid circumstances and we took additional measures to ensure the keys stayed static even though the backing data actually changes over time. This again helped us to provide a consistent view of the data on the stream.
On a busy day, we process anywhere up to a 120 million transactions. How do we ensure we don’t miss any of it? The consequences of missing even one of those is rather huge for us as critical reporting and risk calculations are based on this data and they simply cannot go wrong. We have multiple levels of testing with different levels of coverage at each level. Still does that give us the confidence to say we haven’t missed anything? Nearly but probably not. So we devised a system to reconcile the data we generate with the transactional data that ends up in our Data Warehouse for long term storage and highlight any discrepancies by literally looked for needles in a haystack. We run this even today to highlight any discrepancies in the output and have managed to progressively reduce the differences to nothing or only known values. When we are dealing with such large critical datasets its important to establish an automated feedback loop to validate the data to be absolutely confident of the quality of output.
All of this could be classed as the first generation of our streaming architecture and we will now be focussing on building new capabilities on the back of these streams. It will certainly be interesting to review this in a years time and see how much all this effort paid off for us in the near term, when the field would have again invariably changed and we would yet again be fighting a different battle to up-scale the exchange. There is a lot of really great work that is going on in Betfair alongside these, including a new in house journalling framework, and several core exchange components being rebuilt on this platform, which will take the exchange into the next generation. Watch this space for more details on those initiatives in the next few months.
/Manjunath Shivakumar is a Principal Developer in our Exchange team.