Talking microservices with the man who made Netflix’s cloud famous

Adrian Cockcroft is currently a technology fellow at venture capital firm Battery Ventures, but he is best known for his stint as cloud architect at Netflix. He was the company’s public face as it grew into one of the world’s foremost users of cloud computing and microservices, and developers of open source technology. He also speaks at a lot of conferences.

In this edited interview, Cockcroft discusses his path to Netflix, the lessons he has learned about scalability over the years, and why microservices are arguably more about business concerns than about architectural best practices.

SCALE: Can you give a recap of your career? I think most people know you from Netflix, and now with Battery Ventures, but you were with Sun Microsystems in the early days.

ADRIAN COCKCROFT: The foundation is I grew up in the U.K. playing around with computers in high school in the 1970s when that was a very rare thing. I decided to go and learn something to do with computers rather than do computing directly, so I did physics and electronics at university. The first job I had was an engineering R&D company where we basically built all kinds of embedded systems, and eventually was one of the first customers of Sun in the U.K.

After about three years of that, Sun opened an office across the street, and I persuaded them to hire me as a field systems engineer. I went from being scraggly looking, T-shirt-and-jeans developer, age 20-something, to having to wear a suit everyday and drive around with sales reps and try to persuade other people to buy computers. A little bit of a culture shock. I did that for five or six years.

Just as the internet was starting to happen and we were making this transition from Sun being a workstation company to a server company. One of the first things that happened was that we got this 20-processor machine, and all the software was designed to run maybe one or two processors, so there was a huge scalability performance problem. Then web servers came a long and they didn’t scale either. We had all of that interesting stuff in the lab, so I spent several years figuring out scalability performance problems.

I eventually became a distinguished engineer at Sun for capacity planning and performance-related things. Largely because I wrote a book about it in in the ‘90s. … Eventually, I became a high-performance computing architect for Sun for a year or two, and then Sun laid off that entire team as it shrank.

I went to eBay to sort of lick my wounds and get out of enterprise IT for a bit. Two or three years at eBay, helped form eBay Research Labs and played around. I had a lot of fun building mobile apps, and building datacenter capacity-planning models and things like that.

That was up to ‘07. … It wasn’t working out for me to be in a research lab.

Netflix in ’07 … was trying to hire people who knew about scalability. That was really my specialty at the time, so I managed a team there that was building the home page for Netflix for the DVD business. Another guy, Mark White, was building the streaming home page at the time.He had recently been hired from PayPal and is still there as the VP running all of the personalization stuff there.

Eventually, we shifted around a bit, since I’m not a front-end developer and I don’t really get all of that stuff still. I started doing more backend things and built the services that supported personalization. Eventually that had to move into the cloud, so we figured out how to do that. Then we formed a cloud team, and then I joined that as the overall architect in about 2009 or 2010.

“I think there’s a breed of people that are performance-engineering people. … If you scratch the surface, you usually find they have a physics degree instead of a computer science degree.”

Scalability, as a term, seems to have morphed in ways since your career started, where you’re talking about scaling up on these big 20-core boxes. How do you look at scalability differently today than you did when you were at Sun?

There’s the same set of problems: You’re starving some resource, which means you can’t scale. You’ve either got too much locking, or a lack of concurrency, not enough threads or something. The same basic principles and behaviors apply if you have a big machine with lots of CPUs in it, or a rack full of lots of separate machines. There’s some slight differences in the way they behave, but the basic problems and patterns look the same. It’s mostly about the tooling that changes.

I think there’s a breed of people that are performance-engineering people. You can see them scattered across the industry, and they don’t quite think the same way that normal developers do. If you scratch the surface, you usually find they have a physics degree instead of a computer science degree. That’s extremely common for people, even people who have been working in computers for a long time, that they actually have a physics degree.

Kyle Kingsbury, for example, who has been doing all of the Jepsen torture-testing of NoSQL things. He’s a theoretical physicist by training. … How stuff works, and how to decompose things and figure out how things behave is kind of the core of physics — experimental method.

Credit: Slideshare / Adrian Cockcroft

Why cloud, and how cloud?

What was the rationale 2009 and 2010 to move Netflix into the cloud? What did you see while so many other people were still debating whether cloud was a real thing?

Netflix was growing so fast that it couldn’t build datacenters fast enough. So they would have had to get very good very quickly at building datacenters in ridiculously short time spans — and thrown a huge pile of money at it, because you have to pay money up front for building out datacenter capacity. You see that with something like Zynga, where they put in $100 million or more in one year to just go build their datacenters.

Netflix wanted to spend that money on content. Every time you have $100 million, you go, “I could build a datacenter …” — because that’s the size of datacenter that would make any sense at all for Netflix — “or I could buy another season of House of Cards. Which moves the business forward? I think we’ll spend money on House of Cards, and we’ll just keep paying for datacenters as we go.”

You pay a month after you use it with Amazon Web Services, so the long-term investments are in content, and the expenses are in delivering the content. The cost of the content itself is vastly more than the cost of delivering the content for Netflix by huge amounts. If you spend a little bit more on computers it’s not material … it’s more about when you spend money in what investments and what you’re focused on.

This came from [CEO] Reed [Hastings] downwards, and there was a large team of people. I get more credit than I should for the Netflix cloud stuff, because I was the main person out there talking about it, so I kind of became the public face of it. That wasn’t deliberate.

“The kind of thing where you go in architect the version 2.0 thing for two or three years, and then deliver it with a pink bow on the top and say, ‘Everyone will move now’ — that never works.”

What was the learning process like as you went on? Eventually you started building all these internal tools, and there’s this big service-oriented architecture. I’m curious about the cloud effort grew.

Whenever you do a transition, the way you do it is you prioritize the things that need to be understood, and you do the smallest thing that teaches you the most, and you do that over and over again. It was very much pathfinder projects — very, very surgically we’re going to try this thing out. We’re going to go as deep as we can, and learn as much as we can with the smallest amount of risk. If you structure everything that way, then you discover deep things about the way things behave, and then you come back and change your plans a bit.

For example, the very first piece of Netflix that was running in the cloud was the search auto-complete service. As you start typing in “Ratatouille” and you can’t remember how to spell it, you type “rat” and it says, “You probably meant Ratatouille.” That ran as a service, there was no graphics around it. All of the website that was supporting that was still running in the datacenter. It’s just that as you type that word in, it was sent off to a search index in the cloud.

It’s a trivial piece of technology, but it taught us everything about pushing production systems to the cloud, hooking them up to a load balancer and the tooling we needed to do it. Two or three engineers, I think, worked on getting that built in a month or so maybe. It was a very small piece of work, plus the tooling, but it proved certain things worked. Then, we got the first bits and pieces up and running in the cloud one piece at a time.

That incremental approach works very well for basically taking risk out of the thing. The same thing when we switched from Oracle to Cassandra as well, where we went from a mixture of Oracle and SimpleDB to Cassandra. Again, you stand up one server, you take one backend dataset, you tinker with it, keep tinkering with it until you figure out the recipe that works, and then you duplicate it. And then if that works, it keeps duplicating itself until you’ve got thousands of nodes of Cassandra in 50 or 100 different clusters.

So it wasn’t just a mass migration all at one time?

No. The kind of thing where you go in architect the version 2.0 thing for two or three years, and then deliver it with a pink bow on the top and say, “Everyone will move now” — that never works. I saw that fail at eBay, as well. It would just take a long, long, long time to get the next version.

You have to incrementally build things, so it’s very organic, and it’s an emergent architecture. It’s not designed centrally. It’s whatever anyone needed to do at the time. And a lot of talking about thing so that bad ideas get rooted out and become understood as “avoid this.”

“You asked me what I’m most proud of. I think it’s, basically, that you can’t mention ‘cloud’ or ‘microservices’ or whatever without mentioning Netflix at some point now.”

If you look back at your time at Netflix, is there something you’re particularly proud of that you left there?

The thing that I did specifically, when I went into the architect role for the cloud team, I was talking to Yury Izrailevsky, who is the VP for cloud there, and said, “I want to go out and talk about what Netflix is doing.”

I worked with our public relations team and explicitly went to create a Netflix technology brand. Because Netflix is a consumer brand. When you hear Netflix, you should think about movies.

This is something you have to deeply think about. Because at eBay, they really wanted eBay to be a retail brand. They didn’t want the technology brand to play to eBay, so we didn’t do very many talks at eBay — because they wanted to keep the brand pure. With Netflix, they sort of went, “Well, there’s an opportunity here.”

And one of the problems Netflix has is, “How do you attract the very best engineers in the industry?” If you’re going for talent density, you’ve got to create some attractants. To create the attractants, you’ve got to talk about what we’re doing. We’d already put the culture deck out, which was attracting people in general, but we wanted to talk about the technology.

Then I said, “We can do that, and I can go out and talk about Netflix. And then as we’re developing all this code, we should also back that up with a resource.” We had all the pieces to release a platform. I was pushing the idea that we should release a platform and got other people to buy into it, and then collectively we agreed, “Yeah, we should just go ahead and release this thing as a platform.”

I effectively product-managed that. I wrote, basically, none of the code, but I named a few of the projects somewhere along the way.

You asked me what I’m most proud of. I think it’s, basically, that you can’t mention “cloud” or “microservices” or whatever without mentioning Netflix at some point now. We really did put Netflix on the map as a technology place. The effect of that is they’ve been able to hire some amazing engineers over the years, people that would not otherwise have decided to go to Netflix.

“It’s very hard to find well-written monoliths. Most of them are tangled balls of mud with all kinds of disgusting things going on inside that are broken in very odd ways that are hard to debug.”

Microservices mean faster businesses

You mentioned microservices. One of the things I saw covering the tech side of Netflix was this evolution from talking just about cloud to also talking about microservices. Can you explain that transition?

These newer architectures are built to be dynamic, and designed to be broken into small chunks so you can update them independently. As you try and go fast, and you try and do continuous delivery, it’s just harder to do continuous delivery if you don’t have API-driven infrastructure and self-service.

Cloud, for me, means self-service infrastructure, API-driven self-service. I don’t really care whether it’s public or private. We used to call the things we were building on the cloud “cloud-native” or “fine-grained SOA,” and then the ThoughtWorks people came up with the word “microservices.” It’s just another name for what we were doing anyways, so we just started calling it microservices, as well. … You’re trying to find words for things that resonate

Right now, “cloud” sort of implies you’ve got to migrate to cloud, and people that aren’t migrating to cloud are going, “OK, yeah, but maybe I don’t want to do that.” It’s sort of an operations-y thing, whether you’re running on cloud or not. Whereas “microservices” is really a developer term. It’s developer-driven architecture and you can deploy it on anything you want.

The second reason microservices has gotten really big right now is that it’s the terminology Docker’s been using since the beginning, and Docker has both benefited from this microservice architecture being a thing and also helped enhance it. There’s an obvious way to deliver microservices using containers, but also, containers arrived because we were trying to do microservices. We start with one chicken and one egg, and you end up with a whole chicken farm, or something like that.

There’s a cooperative runaway effect where part of the reason microservices are hot right now is because of containers and Docker, and they were also part of the reason that Docker became of interest.

Credit: Slideshare / Adrian Cockcroft

When you look at this evolution to embracing microservices, how important of a shift in architectures do you think this is? Eric Brewer told me recently it will be bigger than the movement to cloud in the first place. Do you get that same sense?

My pitch, if you’ve seen my slide decks, is that you start of with the business need, which is moving faster, and you need to move faster than your competition. If everyone’s doing waterfall, then you can keep bumbling along. As soon as somebody does agile, you go, “Oh crap, we need to do agile too.” They move up, and now everyone’s releasing code every two weeks.

Then, someone figures out how to do continuous delivery, and it’s putting code out multiple times a day. Then, there’s another one of those “Oh, crap” moments. “We’re getting left behind.”

This is what’s really driving it at the business level. That’s what the CIO and the product people care about. They’re all trying to get product out faster than their competitors. But when you look at what it takes to do continuous delivery, you just end up with something that looks like microservices because you have to be able to break things into small chunks and get them out very fast.

There’s a limit on how big a team you can have to be agile. The limit is probably defined by whatever Etsy is currently doing, because they’re the best people at running a very, very agile monolithic app. It’s an amazing feat, but most people are saying, “It’s amazing, but it’s easier to just do things in smaller chunks. It’s less efficient, but I care more about speed than efficiency.”

“Oh yeah, the last bunch of machines we got you to provision — we stuck Docker on them, and now we’re just doing our own deploys as often as we like.”

There’s definitely a sense that a monolith is a more efficient way — a well-written monolith, anyway, is a very efficient way to do things. But it’s very hard to find well-written monoliths. Most of them are tangled balls of mud with all kinds of disgusting things going on inside that are broken in very odd ways that are hard to debug.

That’s part of the problem: as you get a large team of developers on a monolithic app, it gets harder and harder to build. You want to make it more tractable and, effectively, less complex and surface some of the complexity. I think the overall complexity goes down.

It sounds like almost a business decision to some degree more than a decision that a bunch of architects got together and said, “You know, we’re going to rebuild these in different ways.”

One of the reasons I get really good traction in the talk I’ve been giving, from everyone from management down to the developers, is because I start with connecting it to the real goal here: which is to just go faster, and what are all the different ways you can get friction out of the developer experience and the product release experience. As you speed up your delivery process, you’re reducing the risk by doing smaller and smaller chunks that have lower and lower risk, and you end up going faster and faster. Basically, you want to change one thing at a time.

This ties back into why Docker is interesting. It only takes seconds to deploy something with a containerized production thing like Docker. If it takes seconds to deliver, why are you doing it only once every two weeks to do something that takes a second? You could do it a thousand times a day and it would still not be overhead. Whereas, in the old days, it used to take weeks or months to deliver something because it took forever to test it, and it took forever to procure the hardware. All that stuff’s instant now.

“When your Amazon bill is about the same as what it costs to hire an engineer you go, ‘You know what? Before that doubles I want to just shrink it down and hire another engineer instead of spending the money on infrastructure.’”

That makes sense. One of the things I keep hearing as I’m talking to Mesos users is that development is so much faster now, they’re deploying stuff so much faster as a result of Docker on Mesos, let’s say.

Yeah, there’s lots of experiences like that. The speed is the thing that people want. No one wants their products to take longer to develop.

I’ve heard these kinds of things as well. One anecdote was an ops team that were reviewing their monthly review of their performance of their tickets and were they closing them on time. They looked at their things, and the number of provisioning tickets that they had had basically trended from whatever it was to a tiny fraction of it. It basically disappeared, and they went, “Oh, it must be a bug. Something’s gone wrong with the way we collect this data.”

So they went back and they said, “Yeah, but I can’t remember the last time I deployed a machine.” This is clicking-buttons-on-VMware-off-a-trouble-ticket kind of deploys.

So they bent back and talked to the developers and they said, “Oh yeah, the last bunch of machines we got you to provision — we stuck Docker on them, and now we’re just doing our own deploys as often as we like.” And, “Yeah, thanks. Every now and again we’ll need another one so don’t forget how to provision machines, but we’re not asking for 10 new machines a day. We’re doing hundreds of deploys a day without involving you.”

They were kind of going, “Well, I guess that’s good, but what am I here for now?” There’s that sense of automating things out of existence as a good thing, but you have to deal with the way that the workflows move around.

Credit: Slideshare / Adrian Cockcroft

The future: open source and SaaS

You’ve been away from Netflix for a while, and you’re at Battery now. What’s are you seeing as you’re engaging with what Battery’s portfolio companies?

One of the things I do is act as a consultant to the CTO of companies, and a lot of these companies are SaaS across a bunch of different industries. They go through this thing where you start off building, a Ruby on Rails backend or something, and eventually it gets a bit bigger, and they need to scale it. So some of them I’m helping scale off of a single backend database.

Or they’ve realized that their AWS bill is finally getting to be big enough that they should pay attention to tuning down a bit. My gut feel there is when your Amazon bill is about the same as what it costs to hire an engineer you go, “You know what? Before that doubles I want to just shrink it down and hire another engineer instead of spending the money on infrastructure.”

Sometimes I’m helping them with speeding up development, sometimes with migrations to cloud or between cloud vendors, and sometimes scalability and cost reduction. I’ve got some slides I’ve been using on how to optimize your build-out. Those are, generally, me acting as a technical consultant across the field.

“There’s lots and lots of pieces of our daily lives, and of industries’ daily lives, that are done with horrible, boring software that’s going to be replaced by a SaaS provider at some point.”

Are you seeing now in companies that that the technology is fundamentally different or better? What’s the shift in what’s happening in startups today compared to 10 years ago?

One of the things is that everything is open source. Your hardware is all cloud. Your cost of building something is really tiny now.

There was a conference for children called HackKidCon last year. I did a talk called, “How to be a data scientist for a dollar.” My point was to teach these kids that they could actually go, for less than a dollar spent, they could actually go and create an account on a cloud vendor, and provision Hadoop clusters, and learn to write MapReduce, or learn to use Spark, or learn to do things which a few years before would’ve been a hundred-person team and millions of dollars to just think about. All this stuff is available instantly, and I talked them through some of that.

The fun thing was that I actually had Google Glass and I put it on the head of this 12-year-old boy, and the video is bascially him watching me give the talk, recorded on Google Glass. It’s a kid’s eye view of the event.