InfluxDB
When developers are adopting something to build their product with, they want to do it on open source tools.
Check out our open-source Feature Flagging system - Flagsmith on Github! I'd appreciate your feedback ❤️
Episode Overview
In this episode of The Craft of Open Source, I spoke with Paul Dix, Co-Founder & CTO of InfluxDB. We had a great conversation about the founding of their project, why he decided to build a time series database at a time when that wasn't an obvious thing to do, and the struggles of scaling a business and keeping the open source community happy at the same time. If you haven't checked out Influx already, you should. We use their product at Flagsmith and were really excited to learn more about the team. My favorite quote from the interview came when Paul was talking about a licensing change they made:
"We were shredded. People accused us of doing a bait and switch and all this other stuff. I understand why people were upset and pissed off. At the same time, I do not have a choice. If there was some other way I could think of to do this, I would take that path. In my mind, there were two paths available. One is close the company down and there's no open source project anymore unless somebody wants to go and take it on their own or do this and keep going. The interesting thing is while people were upset on Hacker News, we immediately had people contacting us asking us how they could buy the commercial version. Is it available? When is it going to be available? How much does it cost? All of these things are great things you want to hear if you're trying to create a business."
I really enjoyed my time with Paul and I hope you enjoy the episode.
-Ben
Episode Transcript
Ben Rometsch: Thanks for your time.
Paul Dix: No problem.
{{divider}}
We were wanting to talk to people who have commercial open source companies and open source projects from different directions. I thought that your story and the InfluxDB story was interesting, which is why we reached out. Would you give us a little bit of background about yourself and the business to start off?
I'm a developer. I started the company with my cofounder back in 2012. We started as a company called Errplane. The idea then was that it was going to be a SaaS product for real-time metrics and monitoring. I have a bit of background in machine learning and artificial intelligence. My idea then was that we can build something where the interesting bits would do anomaly detection and apply machine learning techniques to monitoring data. We applied to Y Combinator and got in the Winter 13 batch and building that product. To get to the machine learning bits, we had to build all this infrastructure to get the data, store it, query it and all that other stuff.
Coming out of that in 2013, the product itself wasn't taking off. We'd raised a small amount of money. It was me, my cofounder and one other guy. I thought that the infrastructure that we had built was interesting, but the only way anybody was going to use it was if it was open source. Developers, if they're adopting something to build their product with, they want to do it on open source tools. They can have a little bit more control over what's going on. In the fall of 2013, I had this idea where I was like, “The product isn't going well, but there's something here. Let's spike on creating this as a new open source project.”
We used the exact same technologies that we've used to build the API, which was the programming language. Level DB is the underlying storage engine and built a time series database. The thing is in 2013, nobody was focused on time series at all. Graphite was the most popular project in the space and it hadn't had a release in over a year. It’s a completely orphaned project, but it had a significant community project, which was interesting.
{{divider}}
Was that commercial offerings at the time? I'm trying to remember a few years back.
At that time in time series, there were a couple of commercial offerings in the financial market data space. One was a solution called OneTick made by a company called OneMarketData. Another is a solution called kdb made by Kx Systems, both of which still exist and still sell their products. The thing about those products, at least for OneTick, in 2010, I had worked for a FinTech startup and we had used OneTick as the initial starting point for what we were building. We ended up having to switch over to something that I wrote from scratch. We have this Scala web services on top of Cassandra as the long-term store and read this as a real-time indexing system. The thing about OneTick and kdb was that those products when they're built were largely designed for a small number of time series that move at high velocity.
Think of high-frequency trading where you have bits and quotes in a market and it's happening hundreds of thousands of times a second. Whereas if you look at sensor data, server monitoring data, and a lot of other time series data, what you have is hundreds of thousands or millions of unique time series or tens of millions or even billions of unique time series. If you're Facebook, Netflix, Google or whoever, but moving at a much slower rate. You have way a lot more data and many different ways to slice and dice it. At least for OneTick, it wasn't designed to do that. Commercially, I don't think anybody was doing anything. I'm not sure MonitorDB, DataStax, the company behind Cassandra, they both, at this point, have pages about time series data and stuff like that. I don't even know if anybody was calling out time series as a specific category.
{{divider}}
If there had been 3 or 4 popular open source projects out there, do you think you would've done the same thing?
I don't think I would have. The reason I did it was over the summer of 2013, we had maybe twenty paying customers for our product. I had talked to them and asked them why they were even paying us because there are a ton of other options out there at that time. The thing that I thought was interesting was for a couple of customers who were most enthusiastic about what we're doing is they're using our product as a time series solution. They're building an application on top of our API, in this case, they were doing it for internal purposes. It was still they're building on top of our API using our generic dashboarding solution. The other thing was I went to a conference called Monitorama in Berlin in September of 2013.
It's a conference all about monitoring. I thought, “This will be a good place to try and find some additional customers.” Instead, half the attendees were people building monitoring companies. People from Stackdriver, Server Density, New Relic, Datadog, Circonus, all those people were there. The other half of the people were people at large companies who were trying to roll their own stack. Most of them were using Graphite. Graphite, as it was designed, was not designed to have huge scale. Graphite’s construction is it's a Round Robin database, which means it's only built for what are called regular time series, which is you have samples of fixed intervals at time. You’re saying like, “I'm going to take a sample once a minute or whatever.”
My view of time series is that's only one example. That's the metrics case, but event drip data is also interesting. I view logs, exceptions, traces, individual request to an API, all of that is time series data as well. Graphite wasn't designed to do that. Even for the regular time series data, Graphite would quickly fall down if you had hundreds of thousands of time series. Scaling was difficult. Nobody was focused on the space. Nobody in open source was doing anything specifically for it. In 2013, everybody was thinking about NoSQL databases. The king of NoSQL databases at that point was MongoDB, which still is. When I started InfluxDB as a project, I had seen what happened in the NoSQL space.
I've been paying attention to the NoSQL market developed since mid to late aughts through. I remember in 2008, I interviewed 10gen, which is the company behind MongoDB. At this point, they were still called 10gen and they were doing stuff in addition to MongoDB. At that time in 2008, I thought, “If somebody is going to win the NoSQL market, it's probably going to be React or maybe Cassandra.” I did not think it would be MongoDB.
I was viewing it only from a technology angle, which React seemed to be the superior technology in the space. They offered me a job at that point. The other thing they were doing is server-side JavaScript. I was like, “There’s no way server-side JavaScript is going to take off,” which I was wrong on that one too. When Node.js started, it was within a year or two after that. The thing I took from that watching MongoDB mature over those five years since that time, the thing Mongo did well early on was they nailed the developer experience.
They made it super easy for a developer to pick up their database and build an application quickly. If you make developers more productive, they'll adopt your tech even if there are some holes in it or there are some rough spots or whatever as long as you help them get stuff done faster. When I first created InfluxDB, that was my idea, which is we need to optimize for the developer experience and make it super easy to use. People can build a time series application on top of it as quickly as possible. I gave a couple of talks in early November of 2013. You were asking me, “Were other people doing stuff in that space?” They weren't. In fact, after one of the talks I gave, the CTO of Parsley came up to me. He was like, “I'm glad somebody finally focused on time series in open source database. It seems open source time series databases are a ghetto. Nobody is doing anything in them.”
{{divider}}
Why do you think that was? We felt a similar thing when we started Flagsmith. I was running an agency at the time. We were expecting that there would be 3 or 4 projects on GitHub that we'd be able to choose the one that fit for us. There were a couple. Do you think people are scared about building a database from scratch? I wouldn't try it.
It depends on the scale that you're at. If you look at every single large tech company, they've created their own time series database from scratch, multiple times, most of them. Google has done it multiple times. Facebook's done it multiple times. Netflix has done it four times. Amazon has multiple internal ones. Now at this point, multiple external-facing ones in AWS. It’s same thing with Azure. Every single one of them has created time series database. If you look at the monitoring and analytics providers like Datadog, New Relic, all those people, most of them started with an open source technology that they gradually scaled up. With New Relic, it was MySQL. Datadog in the early days, they were on HBase or something like that. Every single one of them graduated through various levels of doing this and then created something from scratch. The problem there is that for the monitoring analytics companies, they all view the database as their core IP. They're not going to open source it, which means anybody else who wants to build that thing has to build it from scratch.
{{divider}}
That was a leap of faith that you took when you decided to put aside the SaaS product and open source the crown jewels thing?
It was a leap of faith. In 2013, Docker was taking the world by storm and was raising large rounds of funding based only on the popularity of the project. MongoDB that fall had raised some absurd round of funding at $1.2 billion valuation. In my mind, I was like, “Classic four-step process, open source database, get it massively popular, question mark, profits.” My thesis was if I open source this thing and it gets popular enough, there will be some way to make money off of it down the road. The thing is the only way developers are going to use it is if it's open source.
{{divider}}
We felt the same thing. Were you quite conscious of the idea of it being in that order as in points 2 and 3 being like you were going to raise money and you weren't going to plan on generating revenue from that point on? It was a matter of getting some velocity behind the open source project itself.
All of that makes it sounds like I had much more of a master plan and was reality back at that stage. I remember before we started when I got back from the monitoring conference, I was like, “There's something here if we do the open source thing.” I told my cofounder, the other guy who was working with us, “We have eleven months’ worth of money in the bank before we run out. This product that we're building is not going to get us to a point of profitability where we can pay ourselves salaries. It's also not going to get enough traction to warrant another round of funding. If we don't do something, we keep doing the same thing we're doing. In eleven months, we're out of money and we have to get real jobs.”
Originally, my idea was like, “We'll do the open source thing and we'll make this the basis of our product. We'll use the demand behind the popularity of the open source to drive adoption of our actual product that we're building on top of it. What happened was I gave these talks in November. I continued giving some talks and writing blog posts. Over the course of the next three months, it completely took off, or at least, in my mind, it completely took off. It was obvious people were super excited about it. They started using it even though we told them like, “This is total prototype software. Do not trust your data with it.”
{{divider}}
That's a good sign of validation?
It took off quickly that I was like, “We shouldn't even bother with the SaaS product anymore.” In January of 2014, we announced that we are going to be sunsetting that product. We told people we'll help you get moved over to a competitive solution or whatever. We’re like, “Let's focus on the open source thing and see where that takes us.” If we get it more popular at the very least, I should be able to raise a little bit more funding, which would give us more time to figure things out. There's no master plan other than there's something here, people will like it. Let's see what we can do to keep it going for however long that we can.”
{{divider}}
When you made that decision, was that technically easy to open source? Was there a bunch of work you needed to do to check all the code, libraries, choose license and things like that? Was it a rapid process?
It was rapid. It’s not like we took our existing code and open-sourced it. We had the backend API of the monitoring system that we had built. Originally, I had written that thing using Scala, Cassandra and Redis. I had rewritten it using Go as this individual binary, but still with this RESTful-based API. I was like, “If we're going to do this as an open source project, how you package it up and make it available to people is important. Let's start it as a Greenfield project, brand new code. We use the exact same thing.” When you're building the exact same thing a second time, one, you go much faster. Two, the code is cleaner because you have a better idea of how to organize it. The API is cleaner. The one thing that we changed from that old implementation to this was we need to give this thing a query language to make it easier to use. In my interactions with the customers that we had, I realized that the REST API took a little bit more explaining than I would have preferred. The thing we did was we said, “We'll have a query language that looks like SQL. It won't be SQL exactly because we're writing in our own parser, executer and everything.”
That was the one addition. We started it for a code for five weeks, put together a basic documentation website and open sourced it. The only thing we did was libraries and stuff like that. We’re making sure to use either MIT or Apache to license libraries. InfluxDB is MIT-licensed. We wanted permissive licensing to be throughout the entire project. We weren't going to adopt some code that was copy left licensing and stuff like that, which I'm more than happy to talk at length about open source licensing because I feel strongly about that one too.
{{divider}}
It's an interesting subject area that doesn't get much thought or it doesn't seem to get much attention. When we were choosing a license, there wasn't a huge amount of literature on guidance on that side. I was curious to know has that license changed at all since you originally did those first commits?
We've had a number of changes in licensing over the years. From those early days, we’re developing everything out in the open, which is both good and bad. Normally with the database projects, people do it behind closed doors for years until they open it up or without any attention. We were developing everything out in the open. The goal was going to be an open source distributed time series database with no external dependencies. That was the tagline for it. I was putting that in contrast with what was popular for analytics tasks at the time, which was Hadoop or HBase or whatever. These require multiple things to be running, multiple different projects working together. It's a pain to set up.
The goal was it's one thing, you install it, and you're good to go. Fast forward, in 2014, I spent my time evangelizing, giving talks and also fundraising. I managed to close the Series A round of funding, which was $8.1 million, which gave us enough money to add more people to the database team but also start building out the other parts of the open source stack. That was the tech stack, telegraph for collecting data, influx for storage and query, chronograph for visualization and user interface, and then capacitor for processing and real-time monitoring in ETL.
In 2014, 2015, building this stuff, still not focused on the revenue at all. The beginning of 2016, I realized we're going to have to go out and raise a Series B round of funding. We still don't have any meaningful revenue to speak of. Normally to get to a Series B round of funding, you need to have some significant numbers to say like, “Here's how much it costs to acquire a customer. Here's the lifetime value of the customer. Here's all this other stuff.” We had none of that because we weren't making any money. The thing is at the beginning of 2016, there was also this hiccup in the fundraising markets where for some reason, investors got worried and skittish that things were slowing down or whatever.
The winter of 2016 up until maybe the fall of 2016 was a horrible time to try and raise money. We had to. I was like, “We need to have some story for how we make money on this thing.” In the summer of 2015, we had told people, “We’ll do professional services and support. We signed up one customer over a 7 or 8-month period. That wasn't working in terms of commercialization. What I realized was that one thing we could commercialize was high availability and scale out clustering.
{{divider}}
When you say commercialized, do you mean make close source?
My thesis was you had to have some closed source code to be able to make money at all. Red Hat seemed to be the only company in the world that was able to make money with a completely open source code base. I don't think another Red Hat could exist. I still believe that now. I don't think there's going to be another Red Hat. I don't think that's going to happen. The thing is to put everything in context. At that time, there were maybe about 10,000. We had this feature in InfluxDB where once every 24 hours, it will ping like a server to let us know that it's running. We can get some ideas of what versions are running. At that point, 10,000 servers would report in a day.
{{divider}}
Was that the main metric that you were looking at when you were trying to get an idea about how things were going to the investors and stuff?
I looked at that and GitHub stars. I looked at how many times people were mentioning Influx to be on Twitter. If people were writing blog posts about it, if people were giving talks about it. At that time, our community communication stuff was all through a Google group. How many posts were going on that? Trying to get signal for how interesting or vibrant the project is. As you get farther along, you can look at things like Google Trends in terms of people searching for your search term, which is like if you're going to pick an open source project name and it's the primary one, it's good to have a name that's unique. That's easy to find on Twitter search.
The other one is questions on Stack Overflow. You can use these things as external signals to see how vibrant an open source community is. People early on would complain about how many issues there are in GitHub. A lot of people will say, “How do I judge the health of an open source project?” They'll go to GitHub and see the issue count, which is the wrong way to judge it because the most popular projects are guaranteed to have the most issues. Any popular open source project that you look at is going to have over 1,000 issues open at least.
{{divider}}
We've gone through the same process. We're about to build out that telemetry. At the moment, we’re completely blind to how many people are running our platform behind closed doors. Especially when you're small like other than GitHub stars. We're tracking issue count as well. There were not necessarily bugs. They might be asking ideas on best practice and all that stuff. I love it when I get an issue come in. It's someone that you can talk to, but we found it difficult to try and get a handle on open source philosophy. You've got these vague data points that are emitting data. It's completely different to a commercial business.
The phone home functionality, we added that into the project in June of 2014. We were public about the fact that we were adding that in. We made it an opt out thing. There were a couple of people that got angry about that. People don't want that. I understand that. At some point, we need to be able to gather a little bit of information to help us sell the project to people. A customer is also like when we talk to prospects, people who would become customers, part of the thing they want to buy into is a healthy open source project. You need to have these metrics handy to show people like, “Yes, there's activity here.” It's a project that you can build on top of. Having a phone home feature is one of those things that helps you gather that information. I know some people get upset about it, but the truth is our phone home feature is always a thing where there's a configuration setting. If you don't like it, turn it off.
{{divider}}
We had a big argument about whether it should be opt-in or opt-out.
If you make an opt-in, nobody is going to set their configuration to opt-in. The only way opt-in works is if you have a prompt where they can say yes or no. You're still going to get a lot of people opting out.
{{divider}}
Trying to think back to 2013, when we open sourced our project and started selling it, there was such a well-trodden path from people like yourselves and companies that had grown super-fast like GET Lab. Back in 2013, did you feel that change in terms of raising money? If you were in that position that you were in a few years ago but now, do you think the conversations will be completely different?
That would be completely different. In 2013, it should have been obvious in retrospect, but it was not obvious then to the investment community or to the open source community how transformative cloud, cloud vendors, AWS, Azure and Google would be to the open source infrastructure, software infrastructure market. It completely changed everything. These are all the changes that have happened in licensing over the years. Back to that, the beginning of 2016, when we had these 10,000 servers calling in, I was like, “We're going to have to make something closed source so that people will buy it.” I didn't like the idea of becoming an open core company. I don't like open core software. I like open software and then commercial software, but I don't like the two to be blended. We're not even going to have a company here in eight months unless we figure something out. On those 10,000 servers, all but 90 of them were individual servers. They weren't clustered, which isn't surprising.
We had been developing clustering out in the open, but at this stage, we told people this is not ready to use. This is alpha software, don't use it. All the popularity we had up to this point was based entirely on what you could do with a single server of InfluxDB. I thought, “What that means is there's something valuable there.” People are getting value out of it and they're using it. They're telling their friends about it. If we go and we say, “We're going to make closed source and commercial, high availability and scale-out clustering, that sucks. At the same time, there's still something valuable that we're producing that people will hopefully continue to use. We made that announcement. I had a blog post in early March 2020 where I said future versions of InfluxDB will be single-server only. We will have a commercial version. It got to the front page of the Hacker News.
{{divider}}
What happened to that?
We were shredded. People accused us of doing a bait and switch and all this other stuff. I understand why people were upset and pissed off. At the same time, I do not have a choice. If there was some other way I could think of to do this, I would take that path. In my mind, there were two paths available. One is close the company down and there's no open source project anymore unless somebody wants to go and take it on their own or do this and keep going. The interesting thing is while people were upset on Hacker News, we immediately had people contacting us asking us how they could buy the commercial version. Is it available? When is it going to be available? How much does it cost? All of these things are great things you want to hear if you're trying to create a business.
{{divider}}
Are you saying you hadn't taken anything away at that point? Were people using an open source, the alpha clustering stuff?
There were a few people using it. The truth is we didn't take it out of it. We created the next release, which still had that code in it. I told people ahead of time, the next release will be the last release with this code line in it. The next release is not going to have that. Nobody wants to use a dead-end thing that's not being developed anymore. There were some people using it, not that many. A small portion of the user base that we could see. It wasn't about that. I don't think anybody who complained on that Hacker News thread had been using the clustering. There's the expectation that it was going to be available. The plan was always InfluxDB is going to be an open source distributed time series database. That was the plan. That's what I wanted to do. That's still what I want to do, which we'll come back to.
{{divider}}
Has that clustering code ever seen the light of day or is it all completely proprietary?
It's completely proprietary.
{{divider}}
It's not secured by a license that is facing GitHub, but you can't use it without a paid subscription or anything like that?
It is closed source commercial only code. This is one of my strong philosophies about open source, which is I prefer open source code to be open source permissively licensed, MIT, Apache 2, BSD 3 clause. I prefer my commercial code to be closed and commercial. I do not like source available licenses. They are a disservice to the community because if you have your commercial software and you have it as a source available license, which means the code is out there in GitHub publicly viewable. There are two ways to do this. The worst way is you have it in the same repo as your open source code. What that means is people who want to be open source users have to try and disentangle the commercial code from the open source code.
This is famously why AWS created a fourth version of Elastic because Elastic was polluting the open source repo with commercial code. They claim it isn’t a fork, but Elasticsearch Open Distro is a fork in my mind. They may commit to upstream, but it's still a fork. I prefer where the commercial code is closed because the other side of it is if you have a source available code and there are people in your community who for one reason or another disagree with you, they disagree with the direction of the project is taking, whatever. One of the great things about open source is you can fork. You can take that code and if you want to, you can take ownership of it yourself and try to create a new community off of it or project that has a life outside of the life of the original project. If you have a bunch of source available code out there that some company owns and you fork their open source code and start building some of those same features, they can easily come after you and claim that you would use some of their source available code or you looked at their source available code. Whereas if it's closed and commercial, they have to claim industrial espionage for that to hold off.
{{divider}}
I never considered that because we're struggling with that at the moment. You can't write a clean room implementation of it or it would be hard to argue one way or the other.
If you're a heavy user in that community, it's almost certain that you've looked at that source available to you. There's no way you can do that clean room implementation. It's a disservice to the community to have source available. Source available is good for other reasons. It's good in terms of what you're doing is creating a freemium product, which essentially what that is. Remember old shareware games or whatever where you get a free version and then you can upgrade to the paid, but it's not open source. To me, it's not in the same vein of open source. I like open source because I like the fact that anybody can take the code and they can build a business off of it. They can do whatever they want with it. That is not what source available is.
{{divider}}
Has anyone done any serious fork of your projects and tried to run with their own clustering or high availability?
In China, yes. Alibaba has hosted versions of Influx. They have four versions of Influx where they embedded clustering. In Europe and the States, I don't think so. There are hosting providers who host InfluxDB, but they're hosting the open source bits. I don't think that they've forked it.
{{divider}}
They're super heavily sharded.
They'd be able to run on a single server. That was the big licensing change we made in 2016. We've been on that same path since. Largely our code is MIT-licensed where it's open source and the commercial code is closed in commercial.
{{divider}}
You started selling those commercial proprietary features. How was that? It must have been difficult to know you would have little idea of the appetite for that product. Is it a paid product?
That's a paid product. At the time, we were super nervous about making that change because we have no idea whether or not anybody's going to like this. In doing this, we could completely destroy the community that we had built so far on the open source. The worst-case scenario was we make this announcement, everybody abandons us. Our community numbers completely take a dive and we don't get anybody buying the product. What happened was we had immediate interest from people wanting to buy the product. While the Hacker News thread was rough and people were pissed off, the community still continued to grow. We launched the commercial version of the product officially in September of 2016 and began selling it from there.
Fast forward to now, we have a business that's built off of that commercial product. We offer it on premise. You run it yourself. We have a cloud version of it in AWS where you pick a size. We'll provision a new cluster for you, install that software, adds monitoring, alerting its backups onto it and run it. We have a new product that we launched in 2019. Most of our business at this point is built off of that. We've had great success over the last few years building that out. Our Series B round of funding was at that point, largely based on we closed that in August of 2016. At that point, it was based on maybe a little bit of early traction because we started selling that product before we made it available.
That's what the demand was like at that stage. In 2017, we had a great year commercially. In January of 2018, we raised Series C round of funding, which I believe that was $30 million or $35 million. 2018 was great. In January of 2019, we raised a Series D round of funding, which was $60 million. All that fundraising was based off of the commercial traction that we got with the product. The thing is our community has grown a lot over that time. It was February or March of 2016. It was 10,000 servers that we were reporting. Now it's 400,000. The community has grown, but at the same time, we left a big hole in the market.
My thesis was nobody's paying attention in time series. It's a need, but in order for it to be fulfilled, you need something that's scalable, fully open source, which we had taken one part of that away. When we've seen in the year since, there are a ton of competing projects now. You see a new time series database on Hacker News every other week almost it seems like. There are a ton of competing projects, competing companies. We've limited the adoption of InfluxDB. If we had had these features out in the open the whole time. The size of the InfluxDB community would probably be at least 10x what it is. There would probably be fewer competing projects because people would have been more motivated to contribute to Influx and build on top of that. The announcement I made, which I'm not sure if you saw it all. We had our virtual conference. I gave a talk about a new project that we're building. I'm building it myself. I get to write code again.
It's a project called InfluxDB IOx, which is short for Iron Oxide. It's a new core of the database. It's written entirely in Rust. It's an in-memory columnar store with object storage as the backing store. The other thing is it's a system for defining replication rules and partitioning rules of time series data in a set of servers. You can have this weird, federated configuration. You could have a highly connected mesh cluster. All of this is stuff that the operator controls. This whole project is dual-licensed MIT and Apache 2, which is common in the Rust ecosystem. It’s the open source version of, “We don't care, use it. Don't bother us.”
With this project, I wanted to build a new core of the database because I thought that there are a lot of interesting things have happened since 2013 when I first created Influx. I wanted to take advantage of those things and build on top of them. Object storage is the data lake. Kubernetes is this dynamic cluster scheduling environment. The other part of the thesis is a columnar database optimized for time series can achieve as good a performance on what InfluxDB is good at as well as best in class performance on larger scale analytic queries, which Influx doesn't do that well. The architectural design I have in my mind mirrors what the commercial design of the project is, which is the server itself is designed to share nothing server.
You can define rules for replication, data partitioning. All that stuff is out in the open and that's by design. In order for us to run this software and operate it in our cloud environment, there's another piece of software that we have to write, which is the coordinator and operator of a fleet of these servers. That piece of software is closed source that we run in our cloud and we'll make available as a commercial offering. The difference here is that the open source bits of IOx will run those as is in our cloud environment. We will be running the open source code in our cloud environment. We'll be running off of Master, whereas with our 1.x version of our product, there's open source.
Our enterprise commercial offering is a fork of the open source project. That's true open core. If you're going to buy our commercial offering, say you're running Influx to be open source. You decide, “I want the high availability or scale out or whatever.” You buy our commercial product in your place, your open source installation with our commercial product, which is a heavier lift, a bigger ask than another way to adopt it, which is the design for IOx. If you're running one IOx server, 100 IOx servers and you want to buy our commercial product, it's complimentary. You bring it in. What it will do is it will make managing, operating and running those servers easier.
We’re not turning anything off. For me, the design is both technically how I wanted the project to be separated. It's a technical win, but it's also architecturally how I want the business to be constructed. It was thoughtful about the licensing, how the software is going to be commercialized and how it's organized. I wanted the individual IOx servers to be stupid. I wanted all the intelligence for operations to exist in a separate program.
What I wanted is flexibility for the operator to control the operational environment. If they had servers and a highly connected high-speed data center environment, great, or they could have few servers here, a bunch of servers out at the edge. I wanted a number of different kinds of cluster topologies and operational environments to be able to exist. What that means is you can't have that baked into the system itself. I wanted that piece to be separate so that you could iterate on that independently of the actual core of the database itself.
{{divider}}
How much of this was informed by the experience when you create your close components for version one?
All of it. There's no way I could have arrived at this configuration and this project without having gone through the last several years of creating this thing both in terms of the actual technical choices, but also the business choices. Seeing how people adopt it, how people buy it, what things they're looking for. My hope is with IOx that InfluxDB will become a project that is larger than InfluxData, the company. If you're building a platform, the true sign of a good healthy platform is one where the total size of the market for the platform is larger than any single vendor. There is no vendor that's capturing all the value from that platform. For most people who are building a company, that makes them uneasy.
{{divider}}
It sounds counter-intuitive almost?
Yeah. If you’re building a VC-funded business, you want to capture as much value as you possibly can to get the best returns and whatever. The problem is with open source, it's not a zero-sum game. The market is this big and you have to capture whatever. With open source, the more popular project gets, the bigger the market gets. You can get smaller and smaller slices of that market to be as successful. There are other companies making money off of InfluxDB, but none that are even close to what we're doing. You could add up the money that other companies are making off of InfluxDB, sum it together and still going to be less than what InfluxData is making on InfluxDB. What I'd like to see a few years from now is InfluxDB IOx as a project and as an open source ecosystem is way bigger than InfluxDB is now. I want to see InfluxDB IOx crack the top ten in database engines ranking.
{{divider}}
It's competing with the previous projects?
It's the new core of the database. We have version 2.0 of the database. We released the open source. We've got our cloud version of that available. The cloud version is not just a database. It’s like this entire platform. You can define rules for collecting data and monitoring learning rules, dashboards, all this other user stuff. This is the core of the database. What we're going to do is make it available as part of our Cloud 2 offering. People who are creating new versions of database can opt into it. Over time, over the course of 2021 and the year after, we're going to be migrating people over to that.
For our cloud customers, they won't be disrupted at all. The way our 2.0 Cloud product is designed is we're able to replace individual services within it without having any customer downtime or without doing a migration that’s visible to them or anything like that. For our open source users and for our on-prem enterprise customers at some point in 2021, we'll have this as a backend that they can use for InfluxDB. It's an additional piece of software that they can run. It’s the storage backend.
{{divider}}
Are you concerned that you were going to split the community effort and energy? Do you see them as complementary?
I see them as complementary. When you think about the core of a database, few people contribute to the core of the database. Most of the people use it around the edges. The thing is we're bringing in API compatibility so people who are around the edges are going to be able to use it the same way.
{{divider}}
If you’ve got any information on how people can help contribute other than the normal stuff, do you wish people would file more good buck reports or write more documentation?
I view everything as contribution. Even talking about it on Twitter or writing a blog post, giving a talk, that's all contribution. Our community site is Community.InfluxData.com. There is a discourse discussion thread. We also have a community Slack. Some people can find us there. All of that stuff is good.
{{divider}}
Paul, thanks for your time. That was super fascinating. It’s an interesting journey you're on.
Thanks for chatting with me.
For the last twenty years I have worked in technology filling positions as a CEO, CTO, Director of Engineering, software developer, tester, and network engineer. Since 2001 I have worked primarily on web applications. From late 2005 to 2010 my primary focus was Ruby on Rails and the Linux server stack. In the fall of 2005 I went back to school to finish up a degree in computer science at Columbia University. During my time there I pursued interests in machine learning, information retrieval, natural language processing and search.
While in school I continued my professional growth by getting involved in the New York Ruby community and speaking at professional conferences. I completed my degree at the end of 2008 and have been working feverishly ever since.
I’ve written a book (Service Oriented Design with Rails), recorded instructional videos (Working with Big Data), organized over 100 meetups (NYC Machine Learning), founded a startup as its CEO (InfluxData), went through Y Combinator in the winter of 2013, started the open source project InfluxDB, and raised institutional money for the company behind InfluxDB from Bloomberg Beta, Mayfield, Trinity, and Battery ventures.