Microservices Summit 2017-01-31

this is the 2nd microservices practitioners summit - there wasn't much from people who did it at scale
so we wanted to bring in people with that practical experience to talk to you today
20% people have 0 microservices in production - the rest are already running microservices
about 60% of people are interested in resiliency,
We start teh day with technology with @mattklein123 and @varungyan
the hashtag for today is #msvsummit - look for that

Matt Klein:

I'm Matt Klein, a software engineer at Lyft - I'm going to say how we got to a microservice mesh architecture
3-5 years ago Lyft had no SoA - an PHP/Apache monolith with MongoDB as the backing store with 1 load balancer
PHP/Apache's process per connection model doesn't talk to the load balancer, so we had problems there too
2 years ago we had an external LB, a php monolith wiht haproxy to call internal ELBs + a bunch of python services
we had problems with logging and tracing and understanding which layer something died in
in SoA now the industry had tons of languages and frameworks - 3-5 different languages in one deployment
also, there are per-language libraries for making service calls - Php uses curl, Java uses finagle etc
we have multiple protocols - http http2 gRPC, databases etc
we even have multiple infrastructures IaaS CaaS on premise and more
we een have multiple heterogenous load balancers, and very different observability of stats, tarcing and logging
we also end up with multiple implementations of retry, circuit breaking and rate limiting, often partial ones
if you're using all of this stuff in your 5 langs it can be impossible to know what is calling what
and Authn and Authz is often an afterthought, with no key rotation
People do not understand how all these components come together to build a reliable system
People are feeling a lot of hurt, especially around debugging
when I joined Lyft people were actually afraid of making service calls as they couldn't know what went wrong
you have limited visibility into different vendors logging and tracing models, so there is little trust
existing libraries often have partial implementations of best practices -
when we were building Envoy, people would ask why they needed this for retry
retry done wrong is the best way to bring down a system
if you do have a good answer, you are using a library that locks you into a technology stack
if you're invested in JVM and you want to use Go services, you need to port the big library over
if you have a big standard library, upgrading it to a new version can be a huge pain point
Robust observability and easy debugging are the most important thing - without that devs don't trust the system
we have not given people good tools in SoA to do this kind of debugging - so productivity goes down
when people don't trust the network of service calls, they rebuild monoliths with fixed libraries again
Envoy wants the network to be transparent to applications, which is really hard
Envoy is not a library, it is more like nginx or haproxy - it is its own process that is next to each application
the application talks to Envoy locally, Envoy does the work and moves stuff back to the applicaction
Envoy is in C++ and is a byte oriented proxy - it can be used for things other than HTTP - stunnel, redis, mongo
as well as that L3/L4 filter stack we have an L7 http filter architecture that lets you do header work too
Envoy was built to be http2 first, but with an http 1.1 bridge - we can proxy gRPC which is http2 based
we have servoce discovery and active/passive health checking
and advance load balancing with timeouts, circuit breaking rate limiting and so on
we have best in class observability of tracing and stas,
we have enough features to replace nginx as an edge proxy as well as in service to service mode
the model we have is many service clusters with an envoy instance with each service, talking to each other
and also using Envoy to call out to External services and discovery.
your service is only aware of your local Envoy, so doesn't change whether in local, dev or production
Envoy sets up the environment so that dev, staging or production just works - can mix local and cloud abstactly
we have 2 kinds of edge proxies - one terminating TLS and connecting our internal services,
but we also use Envoy to proxy between different geographic datacenters
we have an edge proxy Envoy, which calls the Envoy on our legacy monolith and python and go services
services don't talk to anything without talking to Envoy - it proxies DynamoDB and MongoDB too
The way that most service discovery works is fully consistent systems like zookeeper and consul

but service discovery isn't fully consistent - it changes over time
if you have a fully consistent problem that can be eventually consistent, make it eventually consistent

because service discovery is an eventually consistent problem, we designed it that way
we have about 300 lines of python that checks each server once a minute into a dynamodb
we have active healtheck checks every 15 seconds, and passive restart on fail
we trust the active health check more than the discovery service as it is lossy
if the health check fails we don't route; if the health check fails and discovery shows absent we remove the node
we have not touched our discovery service in 6 months because it does converge
people who do use zookeeper and etcd fully consistent systems build eventually consistent discovery on top
we have multiple service discovery models, including zone aware load balancing - local first then remote
we can generate dynamic stats, and also do circuit breaking and rate limiting too
we plan to open source the rate limiting service next week
we support shadowing so you can fork traffic to a test server
we ahve built in retries, and inner (1 service) and outer (whole callchain) timeouts
you can have all these features, but without observability no-one will use them
by routing all traffic through Envoy, we can produce stats, but also sample entire request chains
because we have a stable requestID, we can trace and log across multiple systems and servers
you can have a dashboard that shows all connections between any 2 services
this lets you look at any 2 hops in the system and how they relate
for all of the data transited through envoy, you can see the flow of requests through services by default
our logging system kibana uses the stable request id to connect all the different components and what happened
a lot of people say 'performance only matters if you are google' - dev time is more important
but latency matters, and people don't think about tail latency - the p99+ problem
we have a lot of tools that make programmers more productive, but much harder to see where time is being spent
throughput may not be the most important thing, but being abel to reason about where time is spent really matters
if the service proxy itself has tail latencies that are hard to reason about, you lose the debugging benefits
you don't want a proxy that adds latency variance and makes your debugging harder
Lyft has >100 services, >10,000 hosts and >2M RPS - we proxy gRPC, mongodb and dynamodb too
we sniff the mongodb and dynamodb traffic to generate stats on performance and latency
we are adding redis soon to reduce outliers
we are spending more time on outlier detection and rejection
we are working to standardize load balancing and rate limiting across services
Envoy has only been open source for about 4 months, but we have a lot of interest already
we want to build a larger community around Envoy
you can get the code at lyft.github.io/envoy

Flynn:

when you were getting lyft to switch over to Envoy, what was hard?

Matt Klein:

we started incrementally - Envoy as front proxy first. then we added envoy to the monolith, then on mongodb
we are now fully deployed but it took a year to get concurrent development

do you reserve a core for envoy?

Matt Klein:

you can do that, but it can make things work. Envoy is non-blocking and parallel. so run 1 thread per core

are data piplelines eg spark clusters integrated with Envoy?

Matt Klein:

we do use them fro LB but we don't use it directly for Spark

can you add filters?

Matt Klein:

we don't have any public docs on filters yet, but multiple companies have written them from the code

a disadvantage is bringing the work into the envoy team - how do you get it out again?

Matt Klein:

that hasn't been a problem so far - "if the word 'car' appears in Envoy we have done it wrong"
the filtering model is extensible enough that we haven't needed to block on Envoy

a lot of systems burn network bandwidth on health checks - do you watch response and health checks separetely?

Matt Klein:

active and passive health checks are configurable so you can decide which to use.
there is perception that active health checking is wasteful, but with plaintext kept-alive http1 it is very low
we run health checks every 15-30s and it is noise in our traffic graph
if it does have scale issues we are working on subsetting these so the traffic does't transit so much
there is no reason that the service discovery system couldn't do health checks too

I like to deploy microservices in docker containers, would that work for Envoy?

Matt Klein:

we support hot restart in Envoy so it can new code deploy without losing connections-that works fine in containers

✏Austin W. Gunter:

we are livestreaming at microservices.com/livestream - follow along there

Varun Talwar:

I'm Varun from Google - I'm here to talk about microservices at Google, but based on our gRPC experience
Stubby is an internal framework at Google for all service to service calls
we want to bring what we learned from Stubby into the newer open source gRPC
people want Agility and Resilience, and that is why we use microservices,
but we also care about developer productivity - as @mattklein123 said observabilty is key to trust
even a 400ms delay can have a measurable impact on search quality and usability
Stubby was an RPC - Remote Procedure Call - framework written when google started, used for all google services
Stubby is a large framework, and parts of it are being made more open
Google's scale is about 10^10 RPCs per second in our microservices
every Googler defined datatypes and service contarcts and magic around load balancing, monitoring and scaling
making google magic available externally - Borg became Kubernetes; Stubby became gRPC
HTTP1.x/JSON doesn't cut it at Google scale - stateless, text, loose contracts, TCP per request, nouns-based
a framework with tighter contracts, more efficient on the wire and with language bindings help a lot
when APIs are evolving at different rates, in classic REST this needs a lot of work in polyglot environments
from a pure compute perspective, having text on the wire isn;t most efficient
we needed to establish a lingua franca for strongly typed data - Protocol buffers released in 2003
declare your data in a common description format, and generate code for any language with fwd/backward compatibility
at Google you are either doing Proto to Proto, or Proto to UI - that's it
protobufs incrementally number fields in order of creation, so you can evolve dat structures over time
the other big protobuf advantage is carrying binary on the wire, giving a 2-3x improvement over JSON
designing for fault tolerance and control is key - sync vs async; deadlines and cancellations; flow control, metadats
different languages default to doing these in different ways, so moving to the core helps
in GRPC we have deadlines, not timeouts - each service will add on time taken and abort if it exceeds deadline
deadlines are expected, but we also need unpredictable cancellations too, when result not needed
with services you need cascaded cancellations that clears all dependent calls too
flow control is not a common occurrence, but matching a fast sender to a slow receiver or vice versa does matter
with gRPC when there are too many requests or responses there is a signal sent to slow down
you can set up service configuration policies that specifies deadlines, LB policy and payload size
the SREs like having this config separately so they can control when things are surging
you don't just need an RPC contract, but also to send metadata for AuthN and trace contexts etc
this metadata helps keep the control flow out of the specific APIs
you want observability and stats - you need a common nomenclature to make sense of large call graphs
you can go to any service endpoint and see a browser dashboard in real time on how much traffic is flwoing
Tracing request through services is key - you can add arbitrary metadata such as client type and track it all
you often have 1 query out f 10,000 that is slow - you want to trace it through the whole call chain
you also want to look at aggregate info to see where the hotspots are
Load Balancing matters, and you need to communicate between the front end, back end and the load balncers
gRPC-lb is moving to a simpler model where the client has a round-robin list of endpoints and the backend models them
gRP is at 1.0 now, and has more platforms and languages now it is in open source
gRPC has service definitions and client side libraries that bridge multiple languages and platforms
http2 is the basis which give streaming and much better startup performance
coming soon: reflection; health checking; automated mock testing
it's at grpc.io and github.com/grpc

Flynn:

how did you switch from Stubby?

Varun Talwar:

that si still happening at google, as a lot of gRPC benefits are already in Stubby
we have to show RoI to service owners inside Google. If you don;t have Stubby, thev alue is clearer

how do you do the proto sharing and distribution? one concept of 'user' or every service has one?
do you share contracts between services?

Varun Talwar:

every service defines its own, apart from things like tracing and logging

do you generate client and service code from protobufs? does it limit flexibility?
yes we generate both; all we generate are stubs i that language - whatever you define is what you get
we try to make our APIs as close as possible to the language you are using, so we have fututes in node etc

Christian Posta:

I'm Christian Posta from Redhat talking managing data inside microservices slides at bit.ly/ceposta-hardest-part
I commit and contribute to apache projects like Camel, ActiveMQ and Kafka
I used to work at a large webscale microservices unicorn company, now I bring that to enterprise
when developers approach Microservices or what was called SoA, they ned to think about more than infrastructure
Adrian Cockcroft warns that you need to copy the Netflix process, not just the results of it
when Enterprise IT approaches this microservices world, there is a mismatch of culture
microservices is about focusing on speed - in terms of being abel to make changes to the system in production
IT in Enterprise has always been seen as a cost center, or as a way of automating paper processes
how do you change a system not designed with this kind of iteration in mind to go fast?
we need to think about managing dependencies between teams as well as services
the difficulty of data is that it is already a model of the world - it hasn't got the human context
even something as simple as a book ends up in multiple models - editions, physical copies, listings by author
each service looks at these things a little bit differently - Domain Driven design helps with this
domain driven design means breaking things into smaller understandable models & defining boundaries around them
enterprise models can end up more complex than purely virtual companies, as the models map to processes
if you write a lot of denormalised data into your databases, you need to plan for the queries you're running
Keep the ACID and relational conveniences as long as you can, but be aware of what they cost
with microservices we're saying "OK, database workhorse, we've go it form here"
saying "a microservice has its own database" sounds very worrying to an enterprise data modeller
microservices means taking concepts of time, delay and failure seriously rather than ignoring them
when we need to distributed writes to maintain consistency, we end up building a transaction manager
it is easy to accidentally build an n+1 problem into your services, and you end up adding extra calls to fix
with CAP, you can't trade off P, so you have to pick C or A - but there are lots of consistency models
you don't need strict linear consistency if there is no causal relationship between entities
in real life there is very little strict consistency—think how paper processes propagate updates over time
a sequential consistency is often a good answer - a log or queue to process data in order
when we do this, we have made the replication and indexing that databases did for us an explicit process
Yelp has mySQL Streamer, LinkedIN has Databus, Zendesk has Maxwell - all do this queue to DB model
There's a project debezium.io that captures database changes and streams them in a queue to something like Kafka
I'm going to do a live demo of debezium.io

what were you going to demo?

Christian Posta:

I was going to start up kafka and a mysql database, and connect them with debezium
debezium captures the primary key of the table and uses this to send the before and after changes to the DB
so we can show a change to the database from mysql binlogs in a JSON form in a kafka queue

how do we make this work in an environment without arbitrary extentions, like Postgres?

Christian Posta:

we are working on adding Postgres log support - there was a PR for that recently

when you mentioned Domain driven design is CQRS in play?

Christian Posta:

CQRS is separating different read/write workloads into different data systems
if your reads are simpler you could use this to transform data into a denormalised system

✏Austin W. Gunter:

we've had around 1000 people tuned in to the stream at https://www.microservices.com/livestream/ and we're starting again

Josh Holtzman:

Microservices are the Future, and always will be
xoom.com is a digital remittance company founded 2001, acquired bluekite in 2014, joined paypal 2016
remittance is sending money between countries - we go between 56 countries at the moment
as a finance company, we have very strict regulatory compliance, both in the US and the 55 other companies
we have 16 years of code and 16 years of data to migrate into our microservices
we have lots of code and lots of tables, and code that assumes all those tables are joinable
xoom was an all java shop, but bluekite made us polyglot - we have many languages and persistence techs now
Paypal acquiring us imposed new rules on us, but they also are used to a polyglot environment
we wanted to break up the monolith when we hit build time limits on our SQL based infrastructure
so a few years ago we started to decouple the teams to reduce the build times
we wanted to understand which parts of our stack were the bottlenecks, and scale them appropriately
moving to microservices we had to change a lot of programming paradigms and idioms
we needed service discovery and monitoring to understand performance
we had snowflake code all over the place in our load balancers and deploy path
we needed to switch to a unified build and deployment pipeline
and we needed to pick apart our databases and define the data ownership and contracts
microservices can be a distraction to our engineers - things like circuit breakers and throttles are hard
API designs need thought - the N+1 problem needs thinking about, and RPC vs REST adds complexity too
with API designs, response code granularity can be a huge issue too - do you pick just some http response codes?
Contracts are a key point if you are polyglot coders - you need strong contracts for packaging and metadata
we use docker containers for microservices; each service has metadata on the containers and runtime introspection
we include things like pager rota in the metadata so you can introspect them to know who to call
we can monitor and manage the instances uniformly with this model
Having polyglot code can slow you down; if you can stay in a single language and db, do it
our service discovery is similar to what Envoy is doing,
we have a custom layer 7 load balancer - zoom.api resolves to the local network
so I can hit auth.2.zoom.api and get all the instances with reputation based routing
we have a zookeeper backend , but we have an eventually consistent layer on top of it
we have a service portal that shows the services and health checks adn routing for each one
if you integrate k8s, you need to think about external vs internal service discovery and ip routing
for monitoring we initially set up a graphite system and threw lots of data at it, and crashed it with load
we were trying to instrument time taken on every call, and this was enough to overwhelm graphite
we chose to use http and json for our internal calls between apis
a call that involves a post and a write as opoosed to a read, is much more complex to monitor
we built a time series for every endpoint and call, and that also created a lot of traffic
we used the dropwizard monitoring library which gave use gauges and histograms as well as counters
we were very worried about performance when we started on this journey - we were worried about extra net traffic
we spent a lot of time instrumenting our code before we made any changes, and I recommend that
we improved the throughput of our service dramatically, primarily because of the shift to accurate monitoring
this helped us reduce contention over shared resources, despite making more RPCs overall
the latency distribution is wider now - we adjusted latency sensitive APIs to be deployed nearby
infrastructure as code matters - TDD isn't just fro code, write tests for deployment too
don't treat deployment code and networking configurations as special - they all need tests too
by standardising app packaging, we can have contracts for deployment too
we use git-flow for new features, with a container per branch using docker-flow, automated+ self service deployments
by standardising deployment pipeline, we can have a portal to enable PMs to deploy versions withtout ops
data ownership is hardest - we are eliminating cross-domain joins and adding apis to wrap them
we have about 100 different services now
the key is to measure everything, and be prepared to scale monitoring to cope
application packaging contracts and delivery pipelines are mandatory
staff a tooling team to build, test and deployment automation and bring in network ops
although our monolith is still partly there, the infrastructure and cultue has improved everything

Flynn:

what was the challenge you were least expecting?

Josh Holtzman:

the metric explosion caught us off guard - be prepared fro that

how do you do integration testing when you have a large number of services?
we have ~200 clusters we can spinup and down in both amazon and our data center
the other integration testing is to create mocks so you can run those. also can run containers locally and route to cloud

Josh Holtzman:

for us, anyone who writes code can't touch the production network and vice versa - this is not very DevOps

sounds like you built togetehr multiple solutions. how do you monitor end to end?

Josh Holtzman:

for our java applications we wrote a wrapper applications to give tracing for free
calico gives us the ability to have routable ip addresses per pod, which helps us with monitoring the whole system
we found that we had 2 services that constantly want to make joins, so they need to be one service not two

you mentioned that this was a cultural change, what was the impact?

Josh Holtzman:

our product managers have been very customer focused. The big change was getting them to think about SLA and contracts as well

were there challenges in batch jobs?

Josh Holtzman:

putting the batch jobs with the domain that owns them makes more sense

Rafi Schloming:

I'm Rafi Schloming from datawire - we founded it in 2014 to focus on microservcie from a distributed systems background
I participated in every version of AMQP and had built lots of distributed systems with them, so I thought it would be easy
I wanted to look back at my learning about microservices
wikipedia isn't helpful here - "there is no industry consensus" "Processese that communicate" "enforce modular naturally"
there are a lot of good essays about microservices, but also a lot of horror stories of going wrong
the 3 aspects I want to cover is the technology, the process and the people
we learned from experts, from bootstrapping ourselves and from people migrating to microservices from many origins
3 years ago it was very technically focused - a network of small services, hoping it would make better abstractions
we read every story of microservices, went to conferences, started the summit ourselves to share the ideas
the people picture: everyone has a developer happiness/tooling/platform team and a service team that build features
technically we saw a control plane for instrumenting the services , the services and a traffic layer
it's a lot of work to build a control plane, so we decided to provide that as a service for the teams
so we ingest interesting application events - start, stop, heartbeat. log these and register services; transform & present
we were building a classic data processing pipe line of ingest, write source of truth, transform and present
for version 1 we built discovery - highly available, low throughput and latency; low complexity and able to survive restart
we started with vert.x and hazelcast and websockets with smart clients
for version2 we added tracing - high throughput , a bit higher latency was OK
version 3 we added persistence for tracing by adding elastic search
this was the 1st hint of pain - we had to reroute data pathways and had coupled changes, and this gave a big scary cutover
v4: we added persistence for discovery, using postgres for persistence, which was another scary cutover -lets fix our tools
Deployment was hard. we had tried docker, but that was hard to bootstrap; kubernetes required google not amazon
we redesigned our deployment system to define the system in git to bootstrap from scratch
this meant we could use minikube locally with postgres and redis in docker images
and then spin this up to production running in amazon with our own kubernetes cluster
we built tooling to make this work across the different dev and deployment environment
did we just reinvent DevOps the hard way? we were thinking about operational factors, we built a service not a server
rather than a Service Oriented Architecture, we had a Service Oriented Development
Architecture has lots of upfront thinking and a slow feedback cycle. Development is more incremental
Development is frequent small changes with quick feedback and measureable impact at each step
so microservices are a developmental methodology for systems, rather than an architectural one
small frequent changes and rapid feedback and visibility are given for a codebase, but harder for a whole system
so microservices are a way to gather rapid feedback - not just tests but live measurement
instead of build - test -deploy we want build - test - assess impact - deploy
so measure throughput, latency, and availability measured as error rate
the experts model of canary testing, circuit breakers and so on are ways of making sense of a running system
Technical: small services; scaffolding for changes Process: service oriented development People: tools and services
working with people migrating gave us much more information
migration is about people. Picking a technical stack for the entire Org is hard; refactoring has lots of org friction
creating an autonomous team to tackle a problem in the form of a service is much easier
some organisations hit a sticking point, others didn't slow down
the way to think about microservices is in dividing up the work: build features (dev) Keep it running (ops)
you can't easily divide along these lines - new features make it unstable. devops stops misaligned incentives
microservices divides up the work - a big app made of smaller ones, that are easier to keep running, aligning incentives
if you think about microservices an an architecture you forget about the operational side of keeping them running
the easy way: start with principles of People and Process, and use that to select the technology

Flynn:

how would you boil this down to one statement?

Rafi Schloming:

start with the people and think how to divide up the work first, let that lead to the technical perspective

how much time did you spend on research and things that didn't make production?

Rafi Schloming:

it's hard to quantify that time spent - it ended up as a fragmented and incremental view

do you see Conways law affecting your team size?

Rafi Schloming:

yes, there is an impact there - trying to fit the information into the picture
that the shape of the team drives he shape of the technology is true, but physics pushes the other way

Nic Benders:

I'm Nic Benders, chief architect at New Relic, talking about Engineering and Autonomy in the Age of Microservices
I want to talk about that you can accomplish in an engineering org with microservices
New Relic started out with a data collection service and a data display service that started out micro and grew
we now have over 300 services in out production environment
Conway's law always is in play - our production environment reflects the communications & dependecies between teams
Conway's law is about how teams communicate, not the actual org chart. It's the edges, not the nodes that matter
Conway: Organisations are constrained to produce designs that are copies of the communication structures of the org
microservices is meant to define teams around each service - that is the core
componetisation via teams organised around business capabilities - products not projects so long term ownership
smart teams and dumb communication pipes - use a lightweight tool like a wiki or blog
durable full-ownership teams organised on business capabilities w authority to choose tasks & complete independently
reduce central control - emphasising information flow from the center and decision making at the edge
Eliminate dependencies between teams as each dependency is an opportunity to fail
having a re-org seems like a good idea, but it doesn't really work well if you just rename and change reports
what if we look at an org structure as an engineering goal? Optimize for agility -not utilisation of a team
if you optimize teams for efficient usage of the team, you make sure that they have a backlog to keep busy
what we need are short work queues and decision making at the edge
as chief architect, I know far less than about the domain than the engineer working on the problem does
at new relic, we're data nerds. We should use data to make our changes, not VPs in offsites
the most important thing in our org change is to break our dependencies betwen teams
we drew the nodes as teams and the edges as dependencies, and simplified universal ones
we proposed some much simpler dependency diagrams, with fewer, stronger teams with full ownership
in a full stack team, you are missing a business ownership component, so we added PMs and tech leads for internal
for the team to work it needs more T-shaped people. with depth in one area, and breadth across others
we abolished architecture reviews, and made each team stand alone and own its decisions
we decided to allow team self-selection, as people know their skills better than we do
we put out all the jobs needed by the department and the engineers pick the ones to do. This is harder than it looks
Managers really didn't like this. Managers tend to follow instructions anyway, so it worked.
Engineers didn't like it either. They didn't trust us -they thought there would be fewer jobs that they wanted
they also worried that they would pick the wrong thing, or that the teams would gel without managers
We almost backed down. But we had to get the teams to self correct. We had failed to empathize with their concerns
we had to communicate over and over that this wasn't a stealth layoff or a job fair, but we would take care of them
we were not shifting the burden of being responsible to the employees but making sure we still looked after them
we defined the teams & the skills they needed, not in terms of positions & got everyone in a room to find new teams
at this point we had at least made it clear that there were other teams that you could move to
about a third of the people there did switch teams - lots of new teams formed from scratch
working agreements per team were defined as "we work together best when…" for them to fill in
the insights team picked Continuous Deployment Weekly demos and Retros, and Mob Programming
Mob Programming is like pair programming, but with 6 people sitting round the computer with 1 typing - huge agility
this reorg really worked - we shipped far more this year than expected, because they worked faster on what mattered
Teams understood their technical remit, but not what the boundaries were - we were used to side projects
we wrote a rights and responsibilities document - teams write own Minimal Marketable Features, but must listen too
maybe you aren't going to try a 6-month re-org, but there are takeaways
you hired smart engineers - trust them. We didn't do this with MBAs and VPs but with the teams themselves
my presentation is up at http://nicbenders.com/presentations/microservices-2017/ and you can tweet me with comments
the main thing I was worried about tactically was making 300 people fit into 300 jobs and teams would not fit
there were a few people in critical roles that we had to keep in place, and we really owe them

what does a manager do in this kind of team?

Nic Benders:

regardless of org structure, managers need to look after their team and the teams' careers
we have spent more time encouraging the embedded PM to work with the team since

what happened giving the teams total control of technology?

Nic Benders:

I sometimes think "that's not the best tech really, it's just some hacker news thing" but if they can go faster…
we have some constraints - we are container based, but it you need, say, an Elixir agent you have to build that too

you mentioned 6 months - was that how long it took to settle down?

Nic Benders:

within the 1st month there were teams up and running, but the experience varied.

did the managers go and find new teams, or were they fixed?

Nic Benders:

in general managers were the core of the team, and engineeers moved to them, which may be why they were unhappy

how did this map to employee performance management?

Nic Benders:

we did reset the performance history, and had a lot of success, and a modest turnover too close to annual average

who owns the space between the teams? how do they call each others code?

Nic Benders:

communication is owned by the architecture team, and we have cross-functional groups for each language etc
we have a product council to say what the key products and boundaries are but no the detail which team does
we mapped every product and service, including external ones, before we moved everyone
we had a 2 week transition to make the pager rotation handovers and deploy work.

after this do the people feel they need another team change? how often do you redo this?

Nic Benders:

it was such a production that we would rather have a continuous improvement rather than an annual scrambling
we have a quarterly review per team, but we want to make it possible for internal transfers be low friction

Susan Fowler:

I'm Susan Fowler here to talk about Microservice Standardization
I stared off thinking I would do particle physics forever, but there are no jobs in physics
I worked on the Atlas experiment at CERN, and then went to Uber to work on their 1000 microservices
there were some microservices that had lots of attention, but we were the SRE microservices consulting team
I also wrote a book called Production-Ready Microservices - there is a free summary version online
every microservice orgnaisation hits 6 challenges at scale
challenge 1: Organisation siloing and sprawl - microservices developers become like services and are very siloed
unless you standardise operational models and communication, they won't be abel to move teams
when you have too many microservices, you can't distribute ops easily, so the devs are fighting ops battles too
challenge 2: More ways to fail - more complex systems have more ways to fail. don't make each service a SPOF
challenge 3: competition for resources - a microservice ecosystem competes for hardware and eng resources
challenge 4: misconceptions about microservices - wild west; free reign; any lang; any db; silver bullet
myth: engineers can build a service that does one thing extraordinarily well, and do anything they want for it
a team will say "we heard cassandra is really great, we'll use that" the answer: "you are taking on the ops too"
microservices are a step in evolution, not a way out of problems of scale
challenge 5: technical sprawl and technical debt - favorite tools, custom scripts; custom infra; 1000 ways to do
people will move between teams and leave old services running - no-one wants to clean up the old stuff
Challneg 6: inherent lack of trust - complex dependency chains that you can't know are reliable
you don't know that your dependencies will work, or that your clients won't overload you; no org trust
no way of knowing that the microservices can be trusted with production traffic
microservices aren't always as isolated from each other as their teams are isolated
The way around these challenges is standardization. Microservices aren't isolated systems
you need to standardize hardware (servers, dbs) Communication (network, dbs, rpc) app platform (dev tools, logs)
the microservices and service specific configs should be above these 3 standard levels
a microservice has upstream clients; a database and message broker; and downstream dependencies too
the solution is to hold microservices to high architectural, organisational and operational standards
determining standards on a microservice by microservice basis doesn't establish cross-team trust, adds debt
global standardization company wide - must be global and general; hard to determine from scratch & apply to all
the way to approach standards is to think of a goal - the best one to start with is availability
availability is a bit high level, map that to stability, reliability, scalability, performance, monitoring, docs
microservices increase developer velocity, so more changes, deployments, instability - need deploy stability
if you have development, canary, staging, production deployment pipeline, it should be stable at the end
Scalability and Performance - need to scale with traffic and not compromise avaiabiliity
fault tolerance and catastrophe preparedness - they need to withstand internal and external failure modes
every failure mode you can think of - push it to that mode in production and see how it does fail in practice
Monitoring and Documentation standards mean knowing the state of the system, and very good logging to see bugs
Documentation removes technical debt- organisational level understanding of it matters
implementing standardisation needs buy-in from all org levels; know production readiness reqs and make culture
my free guide is http://www.oreilly.com/programming/free/microservices-in-production.csp and my book is at http://shop.oreilly.com/product/0636920053675.do
when you have a set of answers to the standards for the whole company, it makes it all easier

Flynn:

how long did it take to fix all the services?

Susan Fowler:

It's still going on; we started with the most urgent services and helped them understand what was around tem

you mentioned proper documentation - how do you keep up with code changes?

Susan Fowler:

that's a good question - keep docs to what is actually useful and relevant. Describe arch and endpoints
documentation isn't a post mortem for an outage, but an understanding of what works

is slicing up the layers a contradiction to devops?

Susan Fowler:

the microservices team should be on call for their own services' outages but need to be both?

See IndieNews