A decade of technical promise and open-source fall-outs
Ten years ago this month, when Lehman Brothers was still just about in business and the term NoSQL wasn’t even widely known, let alone an irritant, Facebook engineers open-sourced a distributed database system named Cassandra.
Back then, the idea that huge numbers of companies would need a scalable database was almost laughable – and that grip of traditional relational database systems is reflected in the mythical moniker given to what would become one of the first of many databases designed to run on a cluster of machines.
Named after the Greek figure who was cursed to utter the truth but was never believed, Cassandra might seem an odd choice for a system whose raison d’être is believability – but it delivered a nice dig at the stalwarts of the RDBMS world… and their trust in a false Oracle.
Today, Cassandra – now under the umbrella of the Apache Software Foundation (ASF) – is regularly ranked in DB-Engines‘ top 10 and is used by big name firms like Uber, Twitter and Netflix.
After being driven by Cassandra-based biz Datastax for the majority of its lifetime, the project recently reached a turning point after a falling out between the firm and ASF .
Now, the project is readjusting to life without a single vendor driving it forward, facing new competition and adapting to a rapidly changing tech landscape.
Casting back to 2008, Facebook engineers Avinash Lakshman and Prashant Mallik were searching for a way to solve the inbox search problem, to store reverse indices of all the Facebook messages sent and received by users.
„The amount of data to be stored, the rate of growth of the data and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential,“ Lakshman wrote at the time .
„The solution needed to scale incrementally and in a cost effective fashion. Traditional data storage solutions just wouldn’t fit the bill.“
The goal was to develop a scalable, high performance, high availability database system – and the first deployment of Cassandra within Facebook was for the inbox search system storing terabytes of indexes across a cluster of more than 600 cores and 120 TB of disk space.
At the same time, Jonathan Ellis – who would go on to co-found Datastax – was evaluating scalable database technologies for his then employer Rackspace, to tackle issues with scalable storage. After rejecting HBase, CouchDB and MongoDB, he hit upon Cassandra, working on it for about 18 months before forming Datastax.
„It occurred to me, as application development was moving to this cloud application world, this was a problem everyone would run into as they needed to scale to their needs,“ he told The Reg . „It wasn’t just going to be an exception that the eBays, Facebooks were going to face – it was going to start affecting mainstream development.“
However, he said that not everyone agreed. „When we started raising money for Datastax, the most common pushback we got from venture capitalists was, ‚There’s five companies in the world that are going to need a scalable database, and Google already has one, Amazon has one, so who’s your market going to be?‘ I think the passage of time has vindicated [ our] vision.“
Having worked on Cassandra before it was brought into the ASF, Ellis was in a good position to request he be made a committer when it was; a year later he became the first project chair, a role he held until 2016.
Back then, there wasn’t much of a Cassandra community – something Ellis puts down to Facebook’s reasons for releasing the technology. „They weren’t looking to be a database vendor. It’s a valid way to open-source but there wasn’t much of a community.“
One of Datastax’s early hires was Patrick McFadin, who quickly settled into the role of community builder, and over the next few years numbers grew and success beckoned; Cassandra was effectively re-written from the 1.0 version and the project can point to a number of technical highs.
„Early on we built a fantastic community of people who were interested in the technology and were using it to solve challenging problems,“ said Aaron Morton, CEO of Cassandra consultancy The Last Pickle, who got involved in the project at about version 0.3. „The community have always been deeply invested in the technology.“
At the outset, the group spent a lot of energy „explaining the compromises and advantages of distributed databases, to get people used to the idea that they don’t always need atomic transactions or have to store data in Third Normal Form,“ Morton said.
So committed were this group of initial adopters, Morton said, that they needed some convincing to get behind the changes brought in with the creation of the Cassandra Query Language (CQL) in version 1.2.
CQL is widely pointed to as the highest point in the technology’s decade. Andrew Cobley, a senior lecturer at the University of Dundee – who discovered Cassandra while trying to decide which NoSQL database to teach to students – describes it as a „game changer“.
„It was a really welcome move that made it so much easier to do the programming,“ he said. „You still had to design your databases and your tables to be efficient, but you didn’t then have to struggle with this completely arcane way of trying to query it. If you understood SQL, you cut it down – you had to understand the rules of Cassandra, but once you’d done that, interfacing just felt like with a SQL database.“
Another highlight, Cobley said, was the introduction of virtual nodes (vnodes), to simplify management of clusters, while Datastax’s Ellis pointed to the implementation of lightweight transactions using a Paxos consensus model, which he reckoned was the first production-ready open-source implementation of Paxos.
However, with the smooth comes the rough. Cobley noted that there have been some “minor things that haven’t worked quite so well as they should have” – but Ellis argued that most of the technical problems were “fairly tractable” by getting the right people in the room.
Where things start to get sticky are the non-technical issues, and for Apache Cassandra the stickiest has to be the 2016 rift between Datastax and the foundation.
At the heart of the spat was something not uncommon for the open-source world: the question of how much control a single vendor should have over the direction a project goes in, and when the foundation should get involved.
This is a fine line to tread, and one that Datastax appears to have over-stepped more than once, whether by intention or error.
As one person close to the project, who asked not to be named, said: “There were one too many accusations of strong-arming the project for the ASF board to not take some sort of action.”
For his part, Ellis said the ASF board of directors felt his firm was „monopolising the community, and that – even if Datastax wasn’t doing anything nefarious – there was a potential for that in having the founder of Datastax as the PMC [ Cassandra Project Management Committee] chair“.
Ellis is fairly frank about the political challenges involved. „I wasn’t completely blind to the tension there… [ and] was just crossing my fingers that I could stay on the right side of the line. And with mixed success, I guess.“
In any event, he said that after seven or so years „it was time to get some new blood and more diversity in the project,“ adding: „No hard feelings.“
In theory, the departure of Datastax could leave the door open for another vendor to step in and lead the way, but observers told The Register that it doesn’t seem likely.
„By design and necessity, Cassandra is complex enough that you can’t just ‚bring someone onboard‘ to offer support in any meaningful way,“ our source said. „It takes a new developer, even a talented one, over a year to come up to speed on the major components of the system. Someone wanting to enter this market would therefore have to buy their way in by peeling existing talent out of the community.