Prometheus

Interview With Julius Volz: Co-Founder, Prometheus, Founder, PromLabs
By
Ben Rometsch
on
August 13, 2024
Ben Rometsch - Flagsmith
Ben Rometch
Host Interview
Host Interview

Entrepreneurs who bootstrap a real startup, build a team, and then go on to create something truly groundbreaking deserve respect.

Julius Volz
Co-Founder Of Prometheus
Julius Volz
Co-Founder Of Prometheus
00:00
/
00:00
https://feeds.podetize.com/ep/pVUFqPmd3/media
00:00
/
00:00
https://feeds.podetize.com/ep/pVUFqPmd3/media

Managing a modern application landscape can be a balancing act. Enter Prometheus, an open-source champion for monitoring and alerting. In this episode, Julius Volz, co-founder of Prometheus and founder of PromLabs, joins Ben Rometsch to discuss the origins and evolution of Prometheus. Julius traces its inception to his early days at Google and SoundCloud, where the need for a robust monitoring system inspired Prometheus. He explains how Prometheus revolutionized monitoring with its pull-based approach, powerful PromQL query language, and flexible data model, enabling effective monitoring, alerting, and automation. Discover how Prometheus continues to shape the future of monitoring!

---

Introduction

Welcome back, everybody. It’s towards the end of the EURO 2024 sporting competition here in Europe. I luckily have with me, Julius Volz. Welcome, Julius. Thanks so much for being on.

Thanks for having me.

Technical Design And Origins Of Prometheus

For those of you who have been following that and are interested, I always like to try and give a time point for these episodes. Do you want to explain a little bit about your background and why you’re here?

Sure. I’m from Berlin. I studied computer science. I went to Google as my first job, and then from Google, I went to SoundCloud. Google was in Zurich. SoundCloud was back in Berlin, my hometown. That’s where Prometheus started. Coming from Google together with another ex-Googler, we missed something like Google’s internal monitoring system to be able to monitor dynamic infrastructures and processes, cluster scheduler-type systems. That’s a very short version. We started building Prometheus in our free time at first and then later at SoundCloud. I can go more into that.

Open-Sourcing Prometheus

I’m the co-founder of Prometheus. This was twelve years ago by now. Prometheus is old. This was 2012. Later on, after I had left SoundCloud and we had fully published Prometheus with blog posts, announcements, and so on, I first did freelancing around it. I was helping companies either use it with training and consulting or custom software development where they wanted to build platforms on top of Prometheus or integrations with it and so on.

Later around 2020-ish, I got an idea for a small product, a visual query builder around the Prometheus query language. That’s when I started PromLabs, which is my company that is just myself. It’s a one-person company. I was thinking that if I built this extra thing, I would at least have the chance of making a little bit of return on it. That is also open source. It’s also part of the open-source Prometheus project as a separate server still. It’s called PromLens. It has made some money in the interim. I got sponsored for open-sourcing it and so on.

Nowadays, I focus mainly on training. That’s both passive courses where you go to Training.PromLabs.com and you can learn all the basics of Prometheus-based monitoring, especially stuff like the PromQL query language, alerting, and all those things. I do live training as well. Companies come to me and say, “We have a team of 30 people. We need training for them in PromQL,” for example, and then we schedule something. That’s the background.

That’s awesome. First of all, thank you for Prometheus. It is one of those things that there’s a handful of open-source tools that the community takes for granted now. It’s like Redis or whatever that is part of the furniture that you can’t live your life without anymore. Thank you so much for that. It’s also interesting. I wonder how many services have been built and are closed source in generally megacorp fan companies where someone with experience of those things has left those companies and wanted that furniture back in the room.

There’s an interesting feature flag analogy here because there are a lot of ex-Facebookers. Facebook has this crazy powerful feature flagging tool that’s an A/B testing or multivariate testing tool that is internal to Facebook. We’ve had this a lot with Flagsmith where ex-Facebookers have come and looked at Flagsmith. I love to see it, but I’m wondering. There must be a fair number of large projects that have had their inception or that birth story similar to you and Prometheus.

The thing that stands out to me from a technical point of interest with Prometheus is that it has a pull model rather than a push model. Was that conceptually the thing that was different from what was out there that made you start it in the first place? Were there other time series databases out there? What were the things that didn’t exist that made you want to start it in the first place?

The pull aspect was an aspect but not even the most prominent one. Some people might not know what Prometheus is. It is a metrics-based monitoring system. The architecture looks like this. You run a Prometheus server or multiple in your organization and you configure that server to pull metrics from things we call monitoring targets. That could mean your own processes running somewhere. It could be a device that you want to monitor that has a little agent running next to it that can produce Prometheus metrics.

This pulling happens over HTTP in a protocol that we define. It’s text-based at the moment but highly optimized where you transfer the current state of all metrics to the Prometheus server in that pull or scrape as we call it. The Prometheus server builds up time series or metrics over multiple successive scrapes and then it gives you this powerful query language called PromQL, which you can use for all the use cases that are relevant in monitoring. That’s dashboarding, ad hoc querying, automation, but also alerting.

Alerting is all tied into this. You can create fine-grain-specific and dimensional alerting rules. There’s a dimensional label-based data model. Both the language and all the other components work with that very well. In the routing, later on, there’s an alert manager component to send you notifications. It’s all about that. It’s about live systems monitoring. It’s metrics only. It doesn’t do logs or traces. If you want to do anything other than metrics or numeric values that go up and down including histograms and so on, then you need, in addition, some other system. That’s pretty common for people to have.

The main design thing back then was, first of all, the data model. You have to think about this from a perspective back then because nowadays, it’s pretty mainstream to have a label-based key value tag style data model that allows you to split up a metric into multiple sub-dimensions. Back then, this didn’t exist outside of maybe OpenTSDB which had this label-based data model and was also inspired partially by Borgmon maybe.

Challenges

It was very complicated to run. You needed Hadoop and HBase or that kind of infrastructure to run. Also, it didn’t have a proper query language. It had minimal operations you could apply at the time. It wasn’t meant to be an active monitoring system that goes out for you, fetches data, checks if something is not working, sends you an alert, and those kinds of things.

Within a highly complicated clustered system, an intricate clustering algorithm may be the first thing to fail if things go bad.

The idea behind Prometheus was first, let’s get the data model. Back then, you had hierarchical ones. In Graphite, for example, it’s like who/bar/biz or whatever. You have a rigid hierarchy. Sometimes, these hierarchy components did mean that one component might be the method of an HTTP request that you’re counting. The next component might mean the status code or something.

It’s all very rigid. You can’t add another one. You need to know which path component is what. This dimensional data model is flexible for partitioning metrics data into different sub-dimensions. Which process did this metric come from? Inside the process, which HTTP method is this counting or which path? It’s anything you want to split up a metric by.

There’s then the query language. We wanted to be able to do math in complex ways between sets of time series. That’s what PromQL allows you to do. For example, you can have a whole list of disc usages, each with a label that indicates their host and the file system mount point, and then you can have a whole nother list of file system sizes also with a mount point and the host. You can join them up on identical label sets and then get ratios for all of these.

You can be very flexible about that. You can specify subsets of labels. You can do many-to-one matching or one-to-many. There are more math and arithmetic operations possible between whole sets of time series that are automatically joined to each other in PromQL than in previous query languages. It’s very optimized towards that time series computation use case.

That’s interesting. The data model is very unstructured but then leans on the query language to allow you to bring that kind of structure together back before it’s shown to a human.

I would even say that, in a way, the data model is more structured than before. If you look at Graphite or StatsD before, you had these path components but you had to know this component means the status code, the handler, or something. Now, there’s an actual label name that you can read that tells you what this means.

It’s a little bit more explicit even but it is loose in the sense that there are no, which is what open telemetry later added, semantic conventions for saying that anytime you track an HTTP method, for example, it has to have this HTTP.server.request.method tag name or label name in Prometheus. In addition, you could add on top for having conventions for how to name things. I would first say these two, which are the data model plus query language.

This was also interesting at the time. At the time when we left it, Google had for a decade already the Borg cluster scheduler, something like Kubernetes but internally. They had this monitoring system that worked very well with that and that could monitor dynamically all the processes running on different hosts, ports, and so on.

We came outside to SoundCloud and they had built something very primitive but similar in the sense that they had built a container-based cluster manager based on The Go Programming Language version zero.something and raw LXC containers. This was before Docker even existed. This was 2012. Docker didn’t exist. Kubernetes didn’t exist.

I was trying to do history. I looked up SoundCloud before we chatted and was trying to think Docker didn’t exist in 2012, right?

No. I got a demo of it before it launched in 2012 or ‘13. 

The initial release was in March 2013.

What we found was they have a cluster scheduler. It was not as sophisticated as one at Google, but the main property was that it already scheduled all these different microservices that made up SoundCloud dynamically on any host of the cluster on different ports every day. Every time a developer said, “Let’s roll out a new revision,” they would scale down the 100 old microservices processes and scale up 102 new ones with a new revision, and so on.

That was all monitored by Graphite, Munin, Ganglia, and Nagios. There were ten different monitoring systems. It was impossible to figure out when there was, for example, a latency spike. Was it in one instance? Was it in all of them at the same time? We didn’t have the power to track enough dimensionality in this one Graphite instance. Also, if we did it, we didn’t have a real query language. Alerting was in a separate silo that was in Nagios. You could sometimes query a little bit of data in Graphite to base your alerts on but that wasn’t the primary means of alerting if something has gone wrong.

The idea of Prometheus is more, “Let’s expose everything as metrics over this pullable format. Now, let’s make the monitoring system know all the things that should exist.” Either you can configure it directly statically like, “Here’s a list of targets,” but that breaks down immediately once you have a cluster scheduler like Kubernetes. You have to discover them. You have to dynamically let the monitoring system discover what should exist. This is called service discovery. This is fundamental and should be part of any monitoring system where you want to know, “Are the things that expect to exist there?” You need some source of truth to compare the actual state with what you want to be there.

Back then, it worked similarly on the internal cluster scheduler. Nowadays Prometheus would talk to the Kubernetes API server and say, “Give me either all the endpoints of a service, the pods, the ingresses, and all these different cluster-level objects that exist in Kubernetes.” You get that whole list in Prometheus and you can map it into a target. You can say, “Let’s regularly go out every 15 seconds or so to these 15 endpoints of the service or the parts,” depending on what you want to do, and pull metrics from them.

There are three functionalities that this gives you, the service discovery. First of all, it tells you what should be there. Second, it tells you what that thing is that should be there because you get a lot of juicy labels about it like, “This is spot so-and-so. This is namespace so-and-so.” You then can attach that to all the time series that you pull from that object, so it gives you great metadata support.

Thirdly, it tells you how to get to that thing technically and pull metrics from it. There’s some IP or host plus port and protocol that tells you, “This is where you can fetch the metrics from this process.” If that pull succeeds, you can ingest the metrics, but if it fails, you can also record that it failed. You can have this synthetic up metric which tells you with a value of 0 or 1 if a target is reachable or not. You can have an immediate simple health alert to tell you, “Something is supposed to be there but it’s not working. It cannot be scraped, so it’s down.”

This is harder at least to do in a push-based system because you now have to configure both ends. Usually, if you have a service process, all it needs to do is spin up an HTTP server serving its current state of metrics that is tracked by one of the client libraries of Prometheus. It doesn’t need to know who it is. It could even have a different identity from the perspective of multiple differently configured Prometheus servers that are run by different teams who want similar metrics but with different identities. You only have all that configuration in the monitoring system whereas in a push-based system, most of the time, if you use OpenTelemetry to push metrics into Prometheus, how do you know that something isn’t reporting and has never reported even?

You’re not trying to prove it negative, right? 

Yeah. Honestly, many people ignore it and are like, “As long as my overall service is still working, I don’t even know that,” or they join in some other set of metrics. For example, there’s a component called kube-state-metrics in Kubernetes which tells you all the pods and so on that should be running. This is more work. You have to join these different data sets together and check if one is missing. You also need to find common identities like labels to match them. That requires more work and especially these identities on both sides to match up. You need to make sure that it all works. Pull is nicer and even simpler in that way.

Plus, the nice thing that I enjoy with a pull-based model is that you can manually go to any monitoring target in a browser and check the current state of its metrics. You could go there /metrics and it shows you, “I found out this many HTTP requests so far. My current number of threads is five.” You can see it. It’s a text-based format. Whereas if you want to do that with a push-based one, it’s a little bit harder. You need to first push it somewhere and then you can look at it. That’s one thing.

Maybe the last thing that made Prometheus successful was that we kept it simple. We didn’t want to have the core architecture be like a super complicated clustered system where there’s some crazy clustering algorithm that might be the first thing to fail in the Bat case where you need your monitoring system. Fundamentally, Prometheus is built up of simple servers, individual ones that scrape and compute alerting rules, and then through the alert manager, send you an alert if something is wrong.

If you want stuff like high availability, you could, for example, run two identically configured ones that scrape the same data from the same targets. It’s all item potent. You’re just pulling the current value. If you want big scalability outside of a single server because eventually, you will exceed the capacity of a single server, you either run many Prometheus servers. The original idea of Prometheus is that every team has one or multiple, or there are also scalable layers on top of that.

That is either Thanos for integrating the view over many Prometheus servers or, for example, Mimir or Cortex for having one big centralized service where you send all your metrics to, and that then stores them in a cluster in a highly scalable and redundant way. These are all trade-offs that you can then think about later. I always recommend if you have a decent-sized use case or a small one, start with a normal Prometheus server and you can still think about scaling strategies at some point.

That’s interesting and super detailed. There’s a ton of stuff I’ve learned. As an end user, the beauty of it for me was this magic of you didn’t have to spend three days doing all of the push stuff. It was that you point it at your service and all of a sudden, all of the data starts to light up and come in. That to me is this beautiful thing. It also means that adding Prometheus support with a pull model for services feels so much simpler. Almost any API is going to be speaking rest anyway, so you can set that metrics page.

I don’t want to malign push too much either because it has its advantages. Pull works well if you’re in a high-trust environment in the data center and so on. Once you have too much network segmentation or a lot of security rules and everything, then it’s a little bit harder to establish a TCP connection to many different endpoints that might be even somewhere on the edge in a customer’s home or so. That becomes harder.

That’s the main difficulty there. I pull for that aspect as well. There’s one more aspect where if you put your monitoring system in charge of pulling, it can also be in charge of rate limiting itself and not overloading itself. You have to take at least more precautions when you push to not DDOS your centralized monitoring service. There’s always around that.

A monitoring system pulling data can also rate-limit itself to avoid overload.

Even Google uses a very different monitoring system that’s more push-based. Monarch struggled for a long time with exactly this, where some random interns map-produced could take down the monitoring system. They push too much stuff. There are different pros and cons. It’s a bit of a religious question. Some people prefer the push-based approach. They’re like, “Let’s push everything to the single one known endpoint to have a centralized collection area there.” There are different pros and cons.

Prometheus’ initial release was twelve years ago. I feel like now, it would probably feel quite natural for a company like SoundCloud to open source a project like that, but it feels like back then, that was quite a progressive thing to do straight off the bat. Can you talk a little bit about whose idea was it to open-source it or how did it get onto GitHub in the first place?

Honestly, we took a lot of liberties back then. We came from Google and they were like, “SoundCloud is always unstable. Can you help us make things better?” Matt Proud and I didn’t know each other before, but we met at SoundCloud and had similar thoughts in that regard. We tried to make things better and both arrived at the same conclusion all the time. Before we can make things better, we need better visibility. We drew this dependency graph. We’re like, “First, we need to figure out what is going on.”

Before we can make things better, we need better visibility to figure out what is going on.

We then started building Prometheus in our free time at first. We were like, “Let’s build it in our free time,” which we did for a couple of weeks or months. There is always a little bit of a gray area. It’s still free time but you’re working. It blended over in the end, especially as we saw more of a chance that this could work and that this would be good to try out at SoundCloud itself.

In a couple of months, we got a prototype working where it ingested some data from a target, stored it, and made it queryable and you could graph it. Prototypes are easy, then once you want to make it work at scale for normal people, that was 99% of the remaining work. At that point, we felt emboldened to tell the first internal alpha customers, “Do you want to try out this new thing we’re working on?”

In terms of open sourcing, since we officially started it in our free time, we put it on GitHub from day zero. We put it on there and didn’t even ask anyone. We made it open source. I forgot if it was already Apache 2.0 or some other initial license. We then started working on “our project” on company time later on. That was the hack around that there.

SoundCloud was the first production user kind of thing.

It was a very technical and political uphill battle for the first one and a half years because it didn’t work very well yet. It always crashed and stuff like that. Everything Prometheus does is different from all the money-trying systems before, how it collects data, the data model, the query language, everything. People were like, “That’s strange. Why do we need this? Can’t you take some of the existing systems and change them?”

We looked at that for a while, the Graphite and so on, but we always arrived at the same conclusions. We would need to replace everything. We need a different data model, a different layer, a different query language, and more efficiency in the storage. Dynamic short-lived time series should be a well-supported thing. You can’t Graphite back then to create the file per time series, for example.

Eventually, it got to a point where everything worked well enough there. The Promeus server was working well enough. I built a dashboard builder before Grafana existed or just came into life back then. We had PromDash for a while. Grafana is fine nowadays. They’re doing a great job at that. We had instrumented the internal cluster scheduler at SoundCloud with resource metrics for every container running on the platform.

Now, people saw the usefulness of the lights going on in that cluster. They’re like, “I can see the CPU usage, the memory usage, and all these different processes with labels on them by revision, where they’re running, and so on.” Suddenly, the burden of learning a new thing was overshadowed by the value they were now getting.

Eventually, there was a flip. Every new service has to have Prometheus metrics. People are like, “Let’s standardize on Prometheus.” It took a couple of years. Also, many more people then joined later on to help improve things, stabilize things, and evangelize it even within SoundCloud like Björn Rabenstein, Ben, and others.

It was a bit of an uphill battle to get this even adopted and built-in SoundCloud. We took a lot of liberties. In 2015, we decided that it was at a place where we wanted to announce it to the wider world. Technically, it had been open source. We had 1 or 2 external users already that we had known from other places but we hadn’t told the world yet.

We wrote a proper blog post in the SoundCloud engineering blog about it and explained why we thought this was a good idea. It hit the nerve of the time because Kubernetes was just starting out in 2015 and it was picking up steam. People were starting to use Kubernetes. They needed a monitoring system that could work with such a dynamic environment where you needed to discover all the different things that are changing around every day. You can’t statically configure all those things anymore in your monitoring system. Prometheus was the perfect match.

It makes sense because Kubernetes from Google also was inspired by Google’s Borg cluster system. Prometheus was inspired by Borgmon, the monitoring system inside of Google to monitor Borg and the services running on it. They’re both analogies to their counterparts in Google, so they philosophically and architecturally work well together. Soon, both had cross-support in terms of service discovery and Kubernetes exposing Prometheus’ metrics.

On the open source front, there was one more little hurdle we had to overcome when we wrote this real announcement blog post from the SoundCloud side because we wanted to make sure that SoundCloud was okay with all of this. There was a short period where the open source program had said, “You have to put this under Github.com/Soundcloud.” We have many repos. They all should be collectively together under the Prometheus project. It would be unwieldy and also not good PR-wise to put everything under the SoundCloud GitHub org. We managed to convince them. In the end, we were allowed to pretty freely release it on GitHub.

A year later roughly when the CNCF was forming, we joined the CNCF, the Cloud Native Computing Foundation, as the second project after Kubernetes. We were accompanying the creation of this foundation early on there. They were still pretty open about how the foundation would work. We wanted to clarify that Prometheus is an independently governed project and not owned by any one company like SoundCloud or anyone else.

We were already thinking, “Do we have to establish our own foundation now? Do we go Apache or something?” The CNCF was forming and we were thinking, “Thematically, it falls well in there. Let’s do it.” They handle all the foundation stuff. They officially own the assets, whether it’s the domain names, trademarks, and so on, but they don’t interfere much with how we run the project at all.

We need to have proper governance and so on, which we then also established, but we came up with the rules of governance and how exactly we make both team decisions, like political-style decisions and technical decisions. SoundCloud allowed that as well. At that point, everything was like, “Now it’s clear. It’s an independent project.”

Future Outlook

We have 30 voting team members from many different companies, whether that’s, for example, Grafana Labs or Prom Labs. That’s me. Red Hat, we still have. At one point, CoreOS was acquired by Red Hat which was acquired by IBM, but Red Hat still exists as a company. Many different companies have an interest or a stake in the Prometheus ecosystem and sometimes also individuals who like building stuff. Maybe that opens the next topic, which is how people make a living working on Prometheus.

That’s one of the things that I was going to ask next. For the engineers who were being paid by SoundCloud but were working on it, did there come a point where they felt like saying, “It works now. It’s got this life of its own now. The boat has launched. Can you come back and work on SoundCloud stuff?” I seem to remember within a period of a couple of years, it was this before and after Prometheus of like, “How are we going to do all this monitoring?” Back then, I was doing consulting for large enterprises. I remember within a very short window of time, everyone was using Grafana or Prometheus. How did you navigate that path?

That was crazy. 2015 and ‘16 were crazy because it went from almost 0 to everyone using it or at least starting to adopt it. For me, exactly what you said happened when SoundCloud’s main business was not building monitoring systems. I understand that. At some point, also said, “Can you work on this and that other thing or some AWS migration of some service.”

I felt a bit burned out that year because I felt like there were almost two full-time jobs. One is always community management because you start getting a lot of pull requests and people interested. Initially, you have so much fire, you want to help everyone use it, and you still build it. That was a large job, the OSS Prometheus part, and then the normal SoundCloud job as well.

Some other factors were the office and other things were not so nice anymore, so I left and took a break for a couple of months. I still contributed to Prometheus a bit. People started writing to me from different companies and saying, “We’re using Prometheus. Can you help us use it?” or “We want to build a scalable version of this as a hosted service.” This was Cortex that we worked on back then. I helped a little bit to build that as well.

That was my story. It's a different story for different people. It’s always a bit of a fight to tell your employer if you’re doing this as part of an employment or full-time job kind of situation that it’s valuable for the employer to have their employees as regular contributors to an open source project, especially one that’s strategic for their own monitoring strategy.

In the case of Grafana Labs, it’s even clearer. They sell hosted Prometheus-based monitoring. The more people they have on the Prometheus team, the more influence inside the project they have but also more Cloud towards customers saying, “We are almost half the Prometheus team here. We built a lot of this, so we know it well.” There are companies like that who allow their employees to spend a lot of time on Prometheus. Grafana Labs is one of the premier ones, but there are also others.

It’s still always a little bit of an uphill battle because there are always these urgent things inside a company. Some fires are burning. You also need to go on call and fight some outages. Open-source work rarely seems as urgent. What’s difficult is to get people to do the general maintenance grunt work that is not so “sexy” or glamorous.

Companies have more of an incentive to say, “Let’s add this one feature that helps integrate with our product,” and then they never come back or something like this. That can be great sometimes, but it’s harder in Prometheus because there’s no one central company owning it to care about this overall health aspect of the project. Does someone triage the bugs? A lot of stuff accumulates over time.

That’s partly still done. Partly, it’s neglected, to be honest. That’s part of most projects anyway. It’s always more or less in a state of neglect. Partially, people do it in their free time or they do get paid but they don’t maybe do it as much as the other stuff that the employer asks of them. It’s a challenge. We’re still doing okay. There are also many pull requests or issues that don’t get enough attention and then they just lay there.

I’ve thought a lot about this. It’s interesting. As open source projects get larger and as they develop and get more mature, that ratio of glamorous to unglamorous seems to go to the point where if you were triaging every bug and reviewing every pull request code, you’d have this 99:1 ratio probably of stuff that you do want to do versus stuff that you don’t want to do but you feel like you have to do.

It’s interesting from my experience anyway that the generally large projects have this thing that there’s neglect but it’s designed into the process in the way that if there are interesting pull requests, important issues, or whatever, at some point, that pull request will get over a certain level of interest. Someone will review that code and it will get merged in or not. It’s the same if an issue is bad enough.

We are dealing with this problem at the moment a bit. It’s not at the same scale, but we are coming up to our 300th open issue. We’re having this debate internally of like, “Do we do what some projects do where they have a bot that collects up things that are over a certain age or stale and closes them or do we not?” Maybe there’s no best one-size-fits-all.

It’s interesting as well that once you left SoundCloud, there were a lot of these large projects that have massive companies like Facebook, Google, Dynatrace, or whatever who are like, “We are going to send five engineers full-time on this project and they’re going to get paid to do a lot of that unglamorous work.” The scale of Prometheus is not to have that. That’s a commendable situation. How did you deal with those problems? 

Prometheus is a bit special in the CNCF if you compare two other projects, which were strategically created by either small or big companies as a thing like, “Let’s create this as strategically like Kubernetes so we can make Google Cloud more competitive with AWS by introducing a standard layer for moving workloads around.”

Prometheus was created initially by two guys who wanted to scratch their own itch and make their own work easier. They never had this one company that was the one company responsible for it and for its health. That’s a constant battle for us as well. It’s exactly as you said. Pull requests and issues are coming in. If they catch the eye of a maintainer, either it happens to be, “Good morning,” and you had your coffee or whatever or it looks interesting and they’re like, “We should have that feature.” That helps. It’s sometimes a bit unfair depending on who has energy, who is on vacation, or who’s interested in what. Something might be picked up or may not.

We do have a regular box scrap where people go through old issues and check like, “What’s the state here? Is someone waiting for a reply from someone?” The people in the box scrap usually ping on the issue like, “Person so-and-so, are you still interested in this or do you want to close it?” That has limited success because then, the people also need to respond to who got flagged, but it helps still a little bit.

The main problem is there’s always an infinite amount of work and a limited number of people and motivation also. A lot of it is about motivation if you’re doing it in your free time as well. For example, myself. I’m a one-person Prometheus-ish company. I earn my money with services around Prometheus. Theoretically, I wouldn’t have to do any open-source Prometheus work anymore but I still do. I do team discussions, build features here and there, and review pull requests. Now, I’m building a new UI. It keeps me connected with a project and it’s fun.

I pick the areas that are most either interesting pull requests for me or when someone has bugged me enough, or I implement features that I think would be cool. If you have one company responsible for a project, that’s a little bit easier because then as the leader of that company, you can pay someone and say, “You work on this maintenance, unglamorous, and boring stuff now, and you get paid for it.” We have some people roughly doing that so that already helps.

With one company responsible for a project, assigning employees to essential yet unglamorous maintenance tasks becomes much simpler.

You must have been beating off VCs who wanted to throw insane amounts of money at you for years. What was the reason for that? Was working on the code and focusing on the products more important to you?

There was interest both early on and then also some years in. There was a peak of that in maybe 2019 where multiple very high-profile VC funds competed. You have those dinners. You talk a lot. They sound very optimistic and are like, “You have to create a company. We’ll put this many million dollars into it. It will be great and everything,” but then you have to ask yourself, “What does that mean for my life?”

It sounds flattering, exciting, and all that. I respect entrepreneurs who start a company, like a real startup, and then they hire a bunch of people and are like, “Let’s go big and create the big new awesome thing.” I would formally have been in a great position for that as the co-founder of the project, like a high-level of trust and all that, but I also noticed I’m not the person for that.

Entrepreneurs who bootstrap a real startup, build a team, and then go on to create something truly groundbreaking deserve respect.

I want to be able to sleep. The things that I would have to do then would not be the things I enjoy anymore like mostly hiring people, dealing with people issues, and so on. Also, I had already been doing Prometheus for a long time by that point. I was like, “I don’t know if I want to tie myself so strictly for 5 to 10 years to this.” You can always theoretically leave your own startup, but that’s never a good look.

In practice, looking left and right at startup founders and how their lives are, I decided, “I don’t think that is for me.” I respect the people who do it. It’s also a very high-risk thing. Most startups don’t work out or they require exceptional sacrifice. VCs always have some motivation at least to make it sound rosy and that everything is going to be great, but in reality, I would not be getting a lot of sleep.

I also saw that other companies at the time were already doing things with Prometheus that went in a similar direction, whether it’s Red Hat, Grafana Labs, or others who offer Prometheus services whether it’s a Cloud-based hosted service or other services around it. It wasn’t a clear empty space. At least you don’t have to compete with very ambitious and competently-led companies doing stuff. 

Success In The Tech Industry

That’s a great position to take. There are vanishingly small numbers of hacker news stories that are number one with that story. Do you think that perspective may be neglected within our industry and community? It feels like this vision of opening the NASDAQ with your IPO and all of that. It’s like the Hollywood film story of the founder. It’s not seen as this fairytale of what success is defined by. I don’t mean to be dismissive, but it’s this quite American or maybe Californian Valley thing of what success is whereas that’s successful with a vanishingly small number of people as well.

The number of people that do that and work on it for five years and then their company gets bought for less than it’s invested is way more common. Maybe it’s a European thing, but I feel like defining what success is happens less often than it maybe should within our industry, maybe partly because of potentially huge sums of value that can be generated. 

It’s a very personal introspective thing. I also may have decided differently. Since I took a bit of time trying to think, “I know myself. How would my life be if I did this? Would I be happy? Would I get sleep and all that?” I already knew, “I have a bit of a lighter sleep and all that. That doesn’t fit me.” It’s a valid success model for people who enjoy that, so I still respect it. You can create more useful value for the world if you do that, but it wasn’t for me.

I do see more people adopting the single-person solopreneur lifestyle. You see it on social media a lot. A lot of people are posting on their little one-person or one-person plus a couple of temporary helpers or so projects. They’re posting what they’re working on, how they’re struggling, and how they’re succeeding. It does exist, but it’s maybe not as mainstream visible as the normal.

With more people adopting the single-person solopreneur lifestyle, their online posts of struggles and successes prove that this path is viable.

That’s the thing. You’re hitting the nail on the head. It’s about the visibility of that. If it’s the Hollywood story, then that’s what journalists or whatever are going to write about. It feels almost courageous for you to say that. You were like, “I thought about what would work for me.” I feel like people don’t ask themselves that question often enough.

I had a similar decision where I didn’t want to take VC money because I didn’t want that to be my life. I have a family. I didn’t want to be away from them. I remember in a previous business, we almost got into Y Combinator. We had a second interview. I had two young children at the time. I remember when I received the email at 3:00 in the morning to say that we hadn’t got in and I was relieved. It was a very strange sensation. I was sad and let down that we hadn’t got into Y Combinator but at the same time, there was this sense of enormous relief.

It was this interesting mind game to go through because that made me realize what my preference was. It was quite lucky for me that I had a way of going through this feeling in this decision-making process with a zero-consequence outcome because my life carried on as it was going on. I was like, “I don’t want to do that.”

That’s like when you toss the coin and then you see how you feel about the outcome of the coin. My life is pretty nice in terms of flexibility. I can mostly work whenever I want. That also means that I can prioritize anything else first. If someone wants to go out and take a walk or have a social life, I can usually say, “Let’s interrupt work for that right now.” I can get back to work whenever I want. Sometimes, there are fixed scheduled things but not a lot in my calendar. I have a few scheduled things per week at most in my calendar, maybe three or something.

That’s the other thing as well. I look at my calendar for the week on a Sunday or a Monday morning. The more blank space in there, the happier I am.

Happiness is an empty work calendar. It’s not that I don’t enjoy working on stuff, but then it’s nice that I can see how my energy levels are today. If I’m super tired, I’m not going to do anything. If I feel great, I’d be like, “What am I going to work on today?” I got up and said, “I’m going to update these three training courses on my website to make sure they’re up-to-date and fresh.” I try to do that regularly. Maybe I’ll end up doing part of that and not all of it because of some work or so. During the day, I might also review a couple of pull requests or look at some emails. That’s always still happening. It’s very flexible most of the time and I like that.

That’s great to hear. In terms of the future for PromLabs and Prometheus, what’s on the horizon?

Especially in Prometheus, we have a couple of new exciting things happening now and in the future. This is going to be mostly interesting for people already using Prometheus. We have this new native histogram metric type that has been added experimentally to Prometheus which allows you to track way more detail for way less cost in your histograms. Histograms are no longer stored as a one-time series per bucket but as this whole histogram in a one-time series sample in a different bucketing scheme. It’s a way more efficient, sparse, awesome histogram. That’s going to be more awesome as that gets adopted and eventually gets marked as not experimental anymore.

We will launch Prometheus 3.0 at some point with a bit of a goal of also reinvigorating the project. You need to keep things interesting. On one hand, it’s signaling, but on the other hand, it’s also a little bit of a chance for us to drop a few little things that are technically breaking between 2.x and 3.x but that nobody relies on our users. Most people should be able to update without any issues. It frees us off some old ballast that we had to carry around and that we weren’t allowed to remove, so that helps.

I’m working on a new and a little bit less obtuse UI for that. The current UI has grown very organically over the years with various people adding filters here and there. It looks cluttered on some pages, like the targets page. I’m working on a new one that is not too revolutionarily different but looks cleaner. It allows us to add some more cool stuff in the future as well. There’s more open telemetry integration going on. More people, instead of using Prometheus’ own instrumentation client libraries, sometimes use open telemetry. There are plenty of downsides to that, so I wouldn’t recommend it.

If you’re using Prometheus and you mainly care about metrics and storing them in Prometheus, then there are many upsides to using one of the native libraries in terms of compatibility, efficiency, and so on. I do get it. Many people do care about having one instrumentation client library that does traces, logs, and metrics all in one. That’s a side comment.

We do see people are using open telemetry more to get data out of processes but still sending the data into Prometheus. There are a number of incompatibilities in terms of the character set that is supported in metric names and attributes in terms of what I mentioned with this pull versus push, how you associate metadata, and so on. We’re working on different ways of improving that to simulate that up metric that you usually have in a pull-based context or to widen the supported character set on the Prometheus side so we don’t have to replace dots with underscores, for example.

There are a few more issues like that that at least will make Prometheus a better storage backend for putting your open telemetry metrics into. That will be exciting for a bunch of people. Those are the main things. There’s always a lot going on. Many Cloud vendors are offering new Prometheus services and so on. In the project, those are the main things. On PromLabs, I’m doing my life training and my courses. There are no huge new plans. If you want to learn Prometheus, check them out. Everything is at Promlabs.com or Training.PromLabs.com for the self-paced courses. 

I feel like there needs to be an addition to SemVer where you can say somehow in the version number it’s 3.0 that there’s almost no likelihood of things breaking. Home Assistant is pretty good because it’s got a database versioning thing but quite often, you’re like, “Is this going to destroy everything or are there three people in the world that are going to get broken by this?” Quite often in the release notes, they don’t say it. That’s the first sentence that you’re interested in knowing about. Is this going to take everything down or not?

The release notes should make it very clear, “This is breaking only in these three circumstances,” and then you can hopefully take a glance and see, “I’m not doing that weird stuff.”

Closing

Julius, thanks so much for your time. I’ve been fascinated. It has been a great story, especially for these software tools. For someone like I feel so close to, it’s great to hear the origin story. I want to thank you again for the project. I wish you luck with 3.0.

Thank you so much, and thanks for having me on.

Important links:

About
Julius Volz
Available for talk, coaching and workshops on:

Subscribe

Learn more about CI/CD, AB Testing and all that great stuff

Success!
We'll keep you up to date with the latest Flagsmith news.
Must be a valid email
Illustration Letter