Interview with Maxim Fateev & Ryland Goldstein: CEO and Head of Product, Temporal
It doesn't matter if you are the 1st or 10th person. If you make a significant (open source) contribution, that's going to be recognized.
Temporal is having their first Developer Experience Conference in Seattle on the 25th and 26th of August 2022. Check it out. I hope you enjoy.
Ben: I’m pleased to introduce Ryland Goldstein and Maxim Fateev. They are from Temporal. They both have interesting backgrounds, and I’m super interested talking to them. Do you want to introduce yourselves and give a little bit of background about the business and the product behind it?
Temporal: I’m Maxim. I am Cofounder and CEO of Temporal. We can probably talk about the history of the formation of Temporal. It could be a separate conversation because it’s a long story. I’m originally from Russia, but I left Russia in 1995 and lived in Brazil for four years. My Computer Science degree is from Brazil, and I moved to the US.
I worked for a bunch of big companies. Most of the largest time I spent was at Amazon. I spent eight and a half years at Amazon and business formation of all this large-scale service-oriented architecture at Amazon from the beginning and after we made the best formation in AWS. I also worked at Uber, Google, and Microsoft before starting Temporal.
I’m Ryland. I’m Head of Product here at Temporal. My background is similar in the sense that I started out very much on the engineering track. The first language I learned was assembly language. I started out with low-level distributed systems. Parallel programming is what I spent most of my time doing as a developer.
I joined a series of startups. I ended up starting one with a few other guys that we ended up getting funded for by Lightspeed, Dell, and some other investors. We built a cloud offering and multitenant service that ended up failing, but it was a good set of lessons to be learned. That’s what I brought in when I came to Temporal. It’s understanding the things mostly not to do with the cloud business model and how to approach selling a cloud product to developers. Since I have been at Temporal, I have been mostly leading product efforts and building out initial functions and capabilities in the company.
How do you guys meet?
It’s an interesting story because the last company that I was talking about, I started with a few other guys. I had left right before the COVID of February 2020. My plan had been to figure out what I wanted to do with the rest of my life. I wasn’t looking for a job, and then all this stuff about the pandemic started happening. I started wondering if that was the right choice. Coincidentally, one of the early investors for Temporal saw some of my writing online because I write for Stack Overflow and a few other publications.
He saw that writing. He cold-called me and said, “You should be doing product work for this company Temporal that I invested in.” That’s when he introduced me to Maxim and Samar. At first, he described it as, “Do you want to come and maybe help with this enterprise workflow solution?” I’m a hardcore C developer that only has done this. I’m like, “That sounds the most awful possible that I could ever imagine.” He was like, “It’s about distributed systems. You are going to want to hear about this.” That’s when I met Maxim and Samar and immediately understood that this is something fundamental. It’s not some enterprise workflow, but it’s way more than that.
Samar is a Cofounder of the company. He’s CTO, and we started a company together. We worked on this Temporal space for many years. We met at the Amazon.
You are the first person I have met who’s a professionally written assembler.
Never professionally. It’s almost ironic, or maybe it’s embarrassing. The reason I got into programming is because I wanted to try bots for World of Warcraft. The way that you do that, in general, is writing x86 and injecting DLLs into a memory of WoW.
Maxim, coming back to you, I’m curious to know in terms of working on open source projects in big tech companies. Has their attitude to those things changed when you’ve been working? I’m thinking that it’s common now for folks like Uber or Google to open source huge projects. Potentially, it might be part of their secret sauce. Was it always like that? Have you seen that change over time since you were originally working on Amazon?
It depends on the company. Amazon level was very open source friendly. I don’t know how many large projects came out of Amazon, at least directly related to AWS, but I might be missing something. There are some of them like Ion and some others, but not like the size of Kubernetes, for example.
Would Firecracker be one?
That system is evolving in this significant project category. My personal opinion of the problem is that these days as a company, you shouldn’t be building infrastructure. If you have built infrastructure as a company, because you are something like Amazon and there is nothing out there that fits your needs, you want to make it an open source.
Amazon might be one exception for that because they have Amazon web services. They can build these large-scale infrastructures and sell them, so it still makes economical sense. The major that feels normal enterprise, not like a cloud provider, and you decide to build your whatever beta base because you know better. You can do the business better than anything out there.
You might do that. You can create the first version of whatever, and it may be very interesting. You can use a portfolio company. The question is what will happen to their thing in ten years. The probability of that project being successful within the company in the long term is almost zero unless it is an open source project. I have seen multiple companies capping in because you build something, you make it available to the company, but then you cannot keep it growing.
You cannot put 50,000 people to maintain that thing. You’ll have a project. Usually, people move on. There is rotation, and a group of people will now move to the next level. This project usually stagnates. I have seen it at Amazon happening because, at the beginning of 2000, Amazon was like, “I have the jobs that scale and the ability to do large scale systems that most of the enterprises out there.”
They built a lot of cool technologies, but none of them were open source. Years later, something similar appeared then became open source, and then a couple of years later, Microsoft was probably migrated to a bunch of those technologies. MapReduce is another good example from Google. Google Builder is probably the best.
I have been at Google at the group. It’s the best MapReduce implementation can get. It’s very complex and scalable, but they never released it. They wrote a paper and duplicated it. In the end, everybody had to duplicate whatever Google was doing. Google can invest in the internal one because it’s huge, but most companies cannot. My point is that the company has open sourced an infrastructure project for the benefit of the company long-term, not because they think it’s cool.
It makes sense intuitively because if you think about the independent contributor in a company, they want to have some recognition for their work. If you have a project which is created by some original person, that person immediately gets recognition because they are the creator. He’s the person who own this initial thing. It’s all about ownership at some level.
The thing with open source is that if you start a project and come in later, it doesn’t matter if you are the 1st or 10th person. If you make a significant contribution, that’s going to be recognized. People can see that. If you come into a company and they have this legacy tool, and it’s not open source, you are like, “This doesn’t get me any credibility if this tool is already not liked. I’m not going to fundamentally make this tool something that people enjoy using.” I could hurt my credibility by continuing to work on something like that, which also leads to this problem you see in companies constantly creating a new version of a project that already exists, which I’m sure Maxim tells you happens all the time.
I have never heard it put like that. In your annual review, you are not going to get any credit for like, “I maintained this five-year-old thing that we built that we still use, but it had a bunch of security holders.” I would imagine that’s not going to go down too well. In terms of Temporal, are you able to explain to folks like myself who don’t have experience dealing with massive workloads or data workloads what problems it solves?
Here’s one thing I want to mention right away. It works from massive workloads, but it also works for very small workloads. We don’t want to get people to get the impression that this is only useful for big guys. A lot of small companies and startups are using us for one of the small use cases, and then you can grow into large ones without changing your code, but it’s not required.
Temporal is about guarantees. Think of it this way. One way to put it is that most people are used to databases and having transactions. The nice thing about that is in transactions, you write code as most of the problems don’t exist. Your transaction either makes a lot, but all intermediate stages or consistencies are solved by the database.
We are moving to a world with these microservices, distributed systems, and so on. We don’t have transactions across microservices for obvious reasons. We tried multiple times, but I don’t think anyone ever uses that in production on a large scale. How are we going to solve this problem? Now, you don’t have consistency, but if you do money transfers to any business process, you want to have consistency at some point. Temporal is one way to solve that. It provides you a based solution, so it’s for developers. You write code, and if we simplify it to the bare minimum, you write the function. Temporal guarantees that this function executes exactly once, no matter what happens.
Think about it this way. If you have a function that is granted to execute in the presence of any failures, you get a lot of interesting properties. For example, this function can run for a very long time. If you say this function can run for months and granted it will execute because it’s not going into a specific process, you can do deployments and rebalance processes, but the function will keep going, all the variable state or that function is always preserved. It’s not only time but also the time of any API call. If you make an API call, it takes five days. You still work them in the line of code, and you wait. We are used to program something which takes milliseconds, but you can write the code the same way as it would take milliseconds, even if it takes a month.
This is the biggest difference. You practically can have something we say, “Do that.” This can be one day. Maybe the system is down for the day, but from the caller’s point of view, there’s no difference. It feels pretty dry on the backend, but is it cold or you don’t care. You can then save sleep for months, which we call sleep, and you block them that line of code for months, and then you go to the next line of code months later. What are you going to do about that? Let’s say you want to manage subscriptions for your customers. You want to build them once a month. How would you do that in the system? You will say LOOP. It depends on the programming language. You can use whatever. We support a lot of languages.
Let’s say in Java, you are looking forward to what they will do for a while, and then you will say sleep one month, then it will say charge, then it will say sending mail. There’s a confirmation email that will happen if it fails to charge because of business reasons or if your credit card cancels the subscription. That can be twenty lines of code. Think about it. If you have 100 million customers and want to do that at scale, and you want to see the system before the full turn, how would you do it without such a system as Temporal?
Maybe I can anchor it from the developer side specifically in a practical example. If you take something basic, it’s not necessarily something that you’d do for a real business, but it is as well. You have some router, some endpoint that you are hitting that results in some function you defined running. All that function does is it makes 3 different calls to 3 different dependencies. One could be your own database. One could be Stripe or Twilio, for example.
Whatever your business process is, imagine that the only way that it’s in a valid state is if all those things happen or none of them happen. If there’s a state where you go through two of them, and you call Twilio and Stripe, and then the thing crashes, and you didn’t call the database, that’s invalid. You have a problem at the business level with that.
As a developer, when you start trying to define a piece of logic like this, even thinking about those individual calls, every time you are making a call, you are like, “Is this thing going to fail or not?” If you are thinking about that, you are like, “I probably have to handle the case where it could fail.” You go through, and you have to add all this extra crap to your code. You start thinking, “What happens when I call this function?” It gets 2/3 of the way through. It crashes, and then someone calls it again like, “When I started this function, I don’t know that happened before.” You then start building extra logic to make sure that you know where you are.
When you are a developer trying to develop something like that, you have so much uncertainty about what’s going to happen. It’s not clear. Even if you are like, “I want to know if there’s an air. I want to know very explicitly,” it’s not given to you that way. You have to dig and understand if there was a broken state that you entered into in the first place.
Temporal deletes that entire equation. You do not have to have to deal with any of those problems. When you start developing that code, you can be 100% certain that what you are coding is going to happen, and there are not any in-between cases where you can get screwed over because you didn’t think it through fully enough.
I have come across these issues from time to time where your data is in a state that you are like, “How can it ever go into that state?” You don’t know why and you can’t figure out why. Maybe half the time, what you’ve described is the way that it’s going into that state. As you’ve assumed that when you hit a Stripe endpoint, it’s going to 200 back, but maybe it does not once in five million calls or something. What was the genesis of that idea? Was there a single moment or problem you were working on where you were like, “What we are doing is nuts?” It’s not like there’s a famous paper that popularized that approach.
It has a very long history. I worked for eight and a half years at Amazon, and Amazon was one of the first companies to move into the service of architecture. They invented that. They started to do it at the end of the ’90s. I was in the team which owns the publish-subscribe subsystem of Amazon. I helped design and came up with this design for distributed storage for queueing systems. Simple Queue Service still uses that database, SQS. I was ticketed for the old pop-up of Amazon back then. This is how we can service it. Services communicate through pop-ups, and this is what most people do now.
We completely realized that for such complex systems as big as Amazon, the pop-up has a lot of instruction. Services that have their own databases are communicating through pop-ups. We can do it, but you end up in a mess. There are many issues with that, like visibility, troubleshooting, understanding what’s going on, or scaling the system.
There are so many things, mostly consistency. Especially if something goes wrong, these queue-based systems are extremely hard to figure out, or even the write code pops because there is no centralized logic. Everything is distributed. We realized that, and then Amazon started to build what we called workflow systems to orchestrate all these calls. Out of these, it’s understanding multiple iterations internally. Publicly, Amazon released the AWS Simple Workflow Service.
It was strictly for the public release of database Simple Workflow. We also generate the backend engine, which is highly scalable but also in the programming model. We did multiple iterations. The first iteration was practically at a very low level. I built something which was more like DSL when you practically create this in memory, an abstract syntax tree of like, “This is a sequence of goal activity. This is parallel execution. This is if and switch statements.” You need to compose these graphs of things in memory, and then you can execute them. I found that this is cumbersome. It’s hard to program. This is the most interesting and still best.
We realized that we could do it for pure code. That is what the relation which we figured out at the Simple Workflow. Samar, my Cofounder, moved to Microsoft, and then he built a similar framework, which we called the Durable Task Framework. Later, Israel functions team adopted that as a durable function that uses the same similar idea but for .NET. Later, both of us joined Uber, and then we started this Cadence project which was exactly that idea.
It took us ten years or whatever multiple iterations to get there. Temporal continued the Cadence project. It’s a form of Cadence project. It’s also open source, and so on, and it came out. It’s multiple iterations. There were 4 or 5 types of frameworks for the program. What we have now is probably a fifth iteration of the same ideas.
Can we dig a bit deeper into the platform itself? What are the core ideas around how it does go about solving that problem? It sounds like a pretty hard problem to solve off the bat.
I don’t know at what level is the best to address it, but the core idea behind it is that there’s a level of checkpointing that goes on. The interesting thing about this type of checkpointing compared to the traditional type of checkpointing, which is usually based on a memory snapshot or something like that or a snapshot of the actual log itself, is that this is more of a checkpointing similar to something like git commit or git history. What has essentially been done is that the way that you traditionally write code is you have a bunch of pure code, let’s call that. In a standard programming language, it’s not making any calls to IO, so there’s no writes the file system. There are no network calls. You are not doing anything over that.
There are two different types of code you are writing. You are writing code that does IO, and you are running code that doesn’t do IO. It’s pure LOOP, logic, and stuff like that. The key innovation that Max stumbled upon and initially ended up in Simple Workflow is the idea that if you can explicitly define APIs for doing stuff that requires IO, like calling to the network or doing anything that’s inherently non-deterministic so you can predict that it’s going to happen that way again. If you can separate that code out, the rest of the code, if you do it a thousand times, a million times, or two trillion times, it doesn’t matter because it’s always going to do the exact same thing.
You don’t need to checkpoint the entire state of the memory of the thing or any other level of granularity other than saving the inputs and the results to the things that require IO. That’s the basic premise of Temporal. You write your code, and you explicitly put your things that do IO and these concepts that we collect activities, which are basically functions.
Temporal knows when it sees an activity or a few other special case primitives that we support. That means that this is something stateful. You can’t reproduce it by running it again. Therefore, you want to send that to the Temporal service in a strongly consistent way and make sure that it’s persisted there before you move any farther with the actual code execution itself. There’s this dance of back and forth between the Temporal clients, the actual application server, which is running your code and usually the user’s environment, and then the Temporal server itself, which is doing the stateful checkpointing and has a database and all of that good stuff.
It’s inserting itself at the transaction begin and commit for anything that is non-deterministic. More often than not, it’s connecting to a data store or an API.
Writing to a file system and then random number generation. How many more are there, Max? Probably not that many.
You’ve written some SDKs for a bunch of popular languages that then make that close to Temporal. The Temporal system can tell you if you need to roll back state in the event of partial failures and things of that nature.
We have two types of failures. Think about it this way. There is business logic, your application. For example, you are saying, “I’m doing a money transfer.” I need to call that bank, put my money, and deposit money in another bank or another service. For example, a withdrawal happens, but a deposit cannot happen because the user closes that account. It’s a business-level failure. It’s not infrastructure level. You have to handle that in your application logic.
In this case, you’ve probably built around some compensation logic to probably put money back on the original withdrawal. That is something you have to write as a developer. What Temporal helps you to do is avoid infrastructure-level problems. For example, what types of your processes crash in the middle of a deposit?
You did a withdrawal, and then your process crashed. Temporal ensure that this process can make a quick recovery and continue execution no matter what. It means that in your code, you should use an account. One way to call that, which people don’t like that much, but it represents this correctly, is it’s a full turn process program. You write code or a program that there is no fault. This process cannot fail. It’s a fully turned process. We eliminate all these classes of problems, so things will be down intermittently. For business-level problems, you still have to deal with them because it’s your business logic. You return there or go around some compensation flow, for example.
Would it be right to say that implementing Temporal would then reduce the need for draining connections or an upgrade to a microservice because you don’t have to worry about those things being interrupted?
There’s an activity code, for example. If it’s important, we can interrupt it and retry. If it’s not, then you probably need to have a different activity that you should run compensation logic. A workflow code does orchestration and calls into those activities. One can state that it can be interrupted at any time. It’s no problem. We can immediately continue the execution.
We don’t run the code. We don’t run either activity or flow code like application tasks the same way in fusing. Your consumers, producers, and clusters can use something else. It’s the same as Temporal. You run your code which contains all your application logic. It connects to the backend Temporal service for the gRPC interface. You can do whatever is what we call working processes. You can kill them any time. You can restart them, do deployments, upgrades, and so on. That makes your life much easier because we practically bring status applications that are stateful.
What came first, the CodeProject, the GitHub project, or the business itself? What was the genesis of that type of thing? Is it like, “This will be a cool thing to hack on. I’m going to put something on GitHub and see what happens?” Was it go down Sand Hill Road, knock on a few doors, raise some money, and then start work?
We had another open source project called Cherami. It was a messaging system, a pop-up system. I joined to build that system because I built a few pop-up systems in my life. It was a relatively successful project within Uber. It was open source. At some point, probably from a business point of view, it didn’t make sense for Uber to invest much in that because Kafka became much better. When we started Kafka, it wasn’t that rock-solid to run the infrastructure at the Uber level. By the time, this project was almost a featured company. We were able to run that, but it didn’t make business sense to keep investing in the business. It’s still out there, but I don’t think anyone uses that.
For that project, because we tried to do like very large-scale system, we needed to do the background scans. We need to make sure that the storage level is scanned periodically, we do repairs, and all these things and be like, “How do we do the background scans?” “We built Simple Workflow at Amazon,” like me and Samar, my Cofounder.
We tried to do the same thing here. We build a simple open source version based on ideas of Simple Workflow. It has a different implementation, but the ideas are very similar. We then moved from that project to get them to customers. We found internal users and internal customers. It didn’t pass the principles like the Uber usage.
They started to use that and then the year because a dozen teams were using that. In three years, we had hundred use cases running on that. It grew organically. The management realized the value of that and funded the project. Uber was pretty good at that. Besides, it was a top-down initiative. We built it on our own. They recognize the value of that. They didn’t start to doubt, and they funded that.
There was a lot of external adoption, which probably made Uber engineering look pretty good at the time. I’m assuming.
The first few years weren’t much, but in the third year, it started to get a pretty decent external adoption. It certainly helped as well. They still solved a lot of problems for themselves. It was a pretty good investment for them.
Going back to the GitHub project itself, when did you first push that into a public repository?
For the first project, Cherami, we built it internally, and then we got permission to open source that. As we built it internally, it took us almost half a year to get to the point where it was extended. There is a lot of internal dependent scripting. You have to keep this separate. Let me start with the Cadence project, which was pretty the same as Temporal. What we did is we started to build it right away in the open source. After a couple of weeks of hacking, everything went to the GitHub public repo, and we are building the whole thing from the beginning in the open. We had some internal integrations, which were proprietary inside the Uber, but it was built in the way that these extensions were possible from the beginning.
It’s not like, “We built this last project, then we released it.” No. They were built into that from the beginning as an open source project. We didn’t get new adoption for the first two years besides the post-production workflow system. At some point, we started to get adoption. I periodically get people from open source. They say, “I want to start a company. I want to do an open source project. What was the profit?” Don’t expect fast, even if it’s useful. It took us two years.
There are some data projects, and you get immediate adoption, but for us, it took some time. Ten years is iteration, but a specific project after it was practically ready to be used. It still took us a couple of years. The cool news is that it’s not like people that joined it. We worked with very serious companies, like Hashport, for example. They are one of the first users.
Door Dash and Checkr.
The logos on your homepage are insane.
We are proud of them. We have some awesome partners that we work with.
It reflects that we are solving the real problem. It’s very easy to put a logo if you have some minor use case somewhere. Some team from my 30,000 people company are using it. A lot of these companies use us for core use cases. We are not talking about, “They do some data pipelines.” We are talking about my core business flows for that. There are a lot of very serious companies using this technology in their core path. Coinbase uses that to do a money transfer. This is the core of their business. There are a lot of examples like that.
Not one of the logos on the front page doesn’t use this for a critical path of their business, every single one of them.
In terms of the commercial side of the business, just to give a bit of an overview, how do you guys cut that cake in terms of completely free and building a sustainable business? Are those chips fall off in terms of what you should make? How does that work?
The thing that is interesting about Temporal and its predecessor Cadence is that, on day one, when Temporal built into its own project from Cadence, it was already a lot of our users had defined this feature completely. It has features on day one that most products and open source never have ever. It has multi-region distribution and fail-over and stuff like that, which is just something that you almost never see with an open source project. From day one, we had less of that problem of what are we going to keep in the open versus in the close because the product was already robust. It was already something that people were very happy with in general.
The other thing that helped us there is that when I joined, one of the things that we were still not 100% decided on was the exact business model. For an open source company, revenue and monetization is always the thing that is up for debate, “How do you do it? What is the right way? What is the wrong way?”
We had been on the cusp of this thing where people had started going, “You don’t need to worry about these legacy business models on-prem or enterprise services or anything like that. You can go all-in in the cloud. It’s the first time ever that’s been possible.” For us, what we felt is that cloud was not only going to be the best thing in terms of our business model and what would maximize our revenue but also what would allow us to keep and be the fairest for the open source project and for the things that were completely in the open. They are not opposing each other as much as it would be if you have some close core model or other things that people often turn to when they need to make more money as a business.
For us, we have been very fortunate in the sense that people shouldn’t be running infrastructure. They don’t want to run infrastructure. People are happy to take that off of their plates. We don’t need to gatekeep that much of the UX and good developer experience from the open source because people are willing to pay us to do what we do best, which is run this technology that we developed at scale. That’s at least my two cents.
How does that work in terms of network latency?
These networks are not that bad instead of call providers. In general, think about it. We are not an application for real-time trade. We are not going to do microsecond latencies but something which goes in tens of milliseconds. The network is very good these days. We caught almost everything that you are doing because for safety reasons like troubleshooting reasons. What it means is that the performance of our applications is limited by the database and data latency. It’s almost always higher than any network latency can get.
It’s important to say we are not a big data platform. We allow you to do business processes. For big data, we are very frequently used by a lot of companies but to build control plans and big things. They are not in the data path of big data, but we have controlled plans for that if you have to orchestrate different paths like infrastructure and all of that.
For big data, it’s very common. A bunch of fellow startups will build ML and data platforms, not only startups around us. There’s one of the big data paths, which means that there is no problem running your application in one data, which we call provider or on-prem, and still connect to the cloud. For example, SaaS software, like our SaaS service, which we expose and still get reasonable performance. As we are not big data, there is not that much data crossing the boundary, so it’s not expensive in terms of data transfers.
Ryland, in terms of your role, it’s a very developer-focused product. Where does your rebate line in that regard? Is it around designing those API interfaces or the developer experience? How does that sit?
One thing to keep in mind is that I joined Temporal very early. When you join a company early, especially in the type of role that I did, you end up doing everything. My focus was much different than it is now. It was much around building up the business. I started a lot of the teams and functions in the company, basically all the product side stuff. That was the focus for the first year and a half. We already had a product that people were happy with, but there was this unsolved question like, “How do we get it out to them?” and more of the distribution and the operational side of things.
My focus is everything that is developer-related, anything that touches developers, open source, and what we call platform, which is future innovations in the product. I spent a lot of time trying to understand. One of the interesting things about Temporal is that we have y idiomatic language SDKs. It’s not like a lot of other SDKs for products where each one is a wrapper around an API. You don’t need to be a go expert like you know how to use the Go SDK. Our Go SDK is very intuitive to your developer if we pull all of our efforts into making sure that’s true for all of our language experiences.
That means that you can’t do this broad strokes data collection and surveying of people because each one could have a fundamentally different experience based on how they are using the product. I spend more time than anything talking to our users. I’m talking to the developers who are building things with Temporal or trying to adopt Temporal and understanding the pain points and the big barriers of entry for you and not for you, but for the people you try to introduce it to. One of the things we tend to find is if someone champions Temporal into their company and is usually a platform engineer, they may be very experienced with distributed systems and concurrency and all of this.
The goal is to get it in the hands of hundreds of other engineers that aren’t maybe as capable and haven’t spent as much time with distributed systems, infrastructure, and all of that. Understanding how we can make those people as successful as the first person who brings it into the company is one of the big problems we are constantly trying to solve at Temporal. That’s a huge part of it. The other thing is based on the complexity and how broad the product is. It is also hard to understand from users what top-level fundamental new types of features they would be interested in and would help them.
A huge part of that is because it’s hard to almost imagine when you are looking at Temporal unless you see the whole thing. It’s hard to imagine something that could be cross-cutting and provide value for the actual product itself. We spent a lot more time trying to figure out, “Are there missing primitives within the system? Are there fundamental things that we could add that would unlock an entirely new set of users or use cases, or a way better developer experience for people?” That’s something which has been very interesting. That will eventually make a lot of progress and success in that direction.
That sounds challenging. I have always found the discipline to say no to something is the thing that defines you. It sounds that you have loads of things come up where it would be like, “That would be cool if we added this.” All of a sudden, you are like, “Is that more of a distraction than the actual product?” Does that happen a lot?
All the time. For many different reasons, that happens. You have the ones where people rightfully see the world from their own point of view. They work in a company that depends on specific technologies or libraries. From their point of view, it’s like, “Why isn’t Temporal integrated with this thing?” There are a million other of those things, and everyone else asks for a different flavor at a bit. Temporal is this very clean core that we want everyone to be able to benefit from, and that’s how you get bloat in software. That’s how you build a product that doesn’t serve any specific user or need.
That’s one layer of it. The next one is people think of a good idea, but the problem is that Temporal is based on the assumption of so many guarantees, contracts, and the way that it executes things like an idea that sounds good. It might have a good user experience when you play it out. Especially with a used case, it starts falling apart.
This is one of the most strong things that has stuck with me and maybe anyone else in Temporal if you work with Max. That’s his number one rule, “Don’t do something if it’s not going to scale.” If there was one type of feature request that we get that doesn’t happen, it’s probably the biggest reason why. It’s a great idea, but it doesn’t scale or fall apart if you want to have strong consistency or whatever it is that Temporal needs to defend. It’s the hardest one to explain to users as well because, a lot of times, it’s pretty complex and nuanced.
It’s the approach. If you look at a lot of open source projects, not all of them, but a lot of them come from people solving the problem. I have a problem, and they solved it the fastest way they could solve it. They start and put it out there and become successful, and then somebody starts to try to run it for high-scale used cases, and usually, it doesn’t work because they are using the implementation. Making assumptions about the scale is like wanting to solve a problem. For example, you have a single database or maybe even some single process that has to process things. Shutting the database of partition might be impossible because a feature is not partitionable and things like that. We came from a completely different opposite direction.
We modeled the whole thing as a public database service. You would build this as a cloud service because my cofounder and I are building cloud services out there and keep building it. Microsoft did it, and Amazon. It was in the mind from the beginning. The problem is that it also means that we are not going to put any feature which we don’t run. For example, 100 million or 1 billion workflows, even if it looks interesting. If you look at successful open source projects, which have the same mindset, they came from many companies. For example, if you look at Kubernetes, it was built with the guys who worked on Google.
These guys ran this extremely high-scale infrastructure for years and have got a lot of these problems, then they said, “We’re going to look at Kubernetes.” This wasn’t scalable from the beginning, but at least the obstructions were. It’s the same for Hadoop. Hadoop by itself was built because they have a set of Google papers, which explain how to build this thing correctly, like Google file system, MapReduce, and so on. They then put this design, and this way, this thing became scalable.
If somebody tried to build Hadoop without those papers, they would need to spend a lot of innovation because Google will do a lot of iterations with those technologies before erasing these papers. As we had previous experience, we were able to do it right. Not because we are so smart. It’s that we did it multiple times, so we knew what we had to do.
Are you guys standing alone in this segment? One of the things that would be challenging for you is for folks like me to know that a solution to this cost problem exists. It isn’t immediately obvious.
It’s true. To directly answer your first question, the two closest true similar technologies, not all that surprisingly, are the ones that Max and Samar built for other companies that are still available as solutions. Those are the true closest things that exist. There are some others that have come out that are coming from similar angles, but not anything that provides this comprehensive or scalable model is what a Temporal provides at least. That’s a huge part of it.
For what you asked as the second part of that question, you are exactly spot on. We see a lot of success with a type of user who has had to try to solve these problems before and almost always has gone and built their own solution first and realized why that doesn’t work. Our success rate with people who haven’t tried building their own version of Temporal first is much lower than with the ones we have because they realized why it’s probably not a good decision to go and do that again. Our biggest competitor is ad-hoc solutions that people build themselves. That is, without a doubt, what we hear probably 7 out of 10 times.
It’s important to realize. In most cases, they are not ad-hoc solutions. It’s more Temporal like, “Let’s build an engine.” It’s usually application-specific. If you are saying you want to do a new infrastructure provision in the framework application, or you need to provision your own cloud, all you will do is revise your script, but they wanted Terraform or whatever. We link these things together in an ad-hoc manner. In the same way, if you’ve got a business process, you’ll probably have a bunch of users like talking to services and data basis, and maybe that type of service. You can always compose those things from these low-level components. The problem is every time you compose that, it’s not reusable.
As you have another application, you compose that again. I’m not talking about the operations, metrics, and all these other things, which are very hard to get right even if you operate for years. This is exactly where the most way people get from us is that they practically can replace this ad-hoc. An example would be before databases existed. People were writing files and doing it manually and databases on here. If you tried to build a business application and say, “No. I’m going to write files myself,” people will think you are insane. After people understand what Temporal is, they get back to the same thing. If you are building it yourself, it’s not right because you are practically redoing something.
It’s not like we are smart. We were just doing it for a lot of time. Also, Temporal itself already has probably many years of just pointing and five years of high-scale production use across hundreds of companies. Try to say, “I can do that myself.” It is impossible. Even if you could have a very dedicated team, you need 5 to 10 years to get there.
That’s part of the beauty of Temporal. There are so many times in a call that you’ll hear some users say like, “Why can’t I do it this way?” Max is like, “I tried all the ways.” It wasn’t like we had arrived at something new.
You’ve got a class of people who have built that ball of string and then realized how hard it is, and then you probably got across people who don’t realize how hard it is. It’s a big challenge. Even a monolithic application naturally becomes distributed because it’s relying on third-party services, the non-deterministic.
Most developers accidentally become distributed systems developers these days.
One of our goals is also challenging. We have TypeScript SDK. We wanted every TypeScript developer who use notes to be able to write these type of systems. My understanding is that it’s very hard, especially in a scalable manner. You can build your startup using TypeScript, do all your business flows using that, and then they say, “If my startup scales 10,000X or 100,000X, should they need to rewrite my systems?”
In Temporal, if you build it right, you don’t need to change your code. You just wanted it to make scale if you designed it right. It makes your life easier to build them. If you want to build something for scale, it takes time and resources, and you don’t want to do that. In Temporal, it is faster to build from the beginning. You also get the scale operations and all these other benefits with you because we have covered the whole life cycle. It’s not only about how you wield things. It’s how we operate them at scale.
Are there characteristics of programming languages that make your SDKs more comfortable to work with? Are there some languages that you know that are going to be a real pick to build an SDK for?
We have a language that supports alternatives. For example, .NET and TypeScript. Those make life much simpler, but when we do support languages like Java, we do provide you with asynchronous work and semantics. When you practically call something, it works for a day, week, or month and it still works. It’s a little bit harder to implement on our side. We also need users from certain APIs. For example, Friends explicitly uses our APIs to create Friends. Otherwise, you still get very native experience. We are doing the same thing in Python. We are looking at Python SDK.
I presume the SDK is non-trivial.
It was historical because Uber is a company. It ties in two languages, Go and Java. Go was much more widely used. The whole service is built in Java and Go. The first SDK was Go SDK. We then build Java SDK. After we left Uber at Temporal, PHP SDK was contributed. It’s more historical because this company in Belarus is a very strong group. They have a PHP shop. They wanted to do it for their clients. They contributed that. It was a pretty significant effort, but they did that and still helped us to maintain that.
They have a popular open source project called Roadrunner. It’s a PHP routing layer. They would be interesting for you to have on your show.
There’s one special answer there. When I joined the company, one of the things that Max immediately was trying to sell me on was that there’s this requirement to understand determinism to use Temporal. This idea was that, “What if there’s a way to avoid that requirement at all?” People could write code and not think about whether it was deterministic, and we would make sure that it’s done correctly. His idea was around WebAssembly because it compiles to deterministic primitives and the underlying language. You can guarantee no matter what language someone writes in, it’s going to be deterministic.
We hired someone, one of my friends who’s an excellent engineer. He knows this stuff, so he was able to make that happen. The TypeScript SDK is unique because you cannot write non-deterministic code. It does not let you do that because it’s run in a deterministic virtual machine every single time it’s executed.
The other thing we did is because our SDKs are very fast and contain complex state missions and logic, what we did is we built Core SDK in Rust. It’s 80% of complex tool-based SDK. The new languages like TypeScript, Python, and .NET are built on top of that library. It’s probably much easier to use languages now. Otherwise, rebuilding that 80% of code took us almost a year for a strong team to build that Core.
There is a lot of complexity in there. It’s not a small project. We are using that, and we can build this SDK much faster. We use SDKs in work as well. In the future, we can see community, especially as we prove documentation there. I can clearly see the community contributing or even as direct language. I want to be there in almost every language out there.
Why did you choose Rust over Go for the Core?
This is a library. You have Python coded into it. Go is not a good language as a shared library. I love Go. For the backend system, it is awesome. For common language, we also use Go. If it’s a shared library, Go is not the right one.
Rust guarantees. You get very explicit guarantees when you are running it.
That’s why it’s Rust, not C++. Go has a runtime. It waits. Rust doesn’t come with a runtime. It’s a much lighter-weight library in general. In the future, we’ll see. I want to make around Temporal and phone it out there one day. Maybe Rust will be part of that.
I have grown some brain cells, which is unusual for me. I’m going to go and have something to eat, and then I’m going to kick around with your documentation and have a play around it because I’m super fascinated by the whole concept. We would love to dig into it deeper. Thank you for your time. I’m curious to see where you guys land.
My marketing team will kill me if I don’t mention that. We have a conference coming up in Seattle called the Replay. If you are interested in Temporal or distributed systems or how to make the developer experience better, you should be there.
You might have to jump the queue for the other episodes so we can get that out before then. That sounds like it would be a fairly interesting thing to attend to. Thanks very much. Maybe I will give you a show in a couple of years, and we could do this again.
We’d love to do that.
About Maxim Fateev & Ryland Goldstein
Temporal is a microservice orchestration platform which enables developers to build scalable applications without sacrificing productivity or reliability. Temporal server executes units of application logic, Workflows, in a resilient manner that automatically handles intermittent failures, and retries failed operations.
Temporal is a mature technology, a fork of Uber’s Cadence. Temporal is being developed by Temporal Technologies, a startup by the creators of Cadence.