05 Mar 2026

Surviving Database Migrations: Ambitious and a little crazy

Surviving Database Migrations: Ambitious and a little crazy

At Plaid, we just finished a two-and-a-half-year project to migrate 234 databases across 100 services from AWS Aurora MySQL to self-hosted TiDB.

We’ve been asked many times by colleagues in peer organizations and by our vendor what made us different in achieving our migration with a high velocity and high correctness without a commensurate high headcount or mature team.

The following is a treatise to answer that question from our lived experience.

Where it began

Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. - Churchill

At the end of 2021, I was a team manager for a platform team that depended heavily on a nosql data store that was unruly, unmaintained and massive. We considered managed services, but decided it was cheaper to build out our own team, and we’ve been right about that!

Due to repeated outages, I secured headcount by the end of the year to build out an online storage team. In 2022, I was the director for the core services area of the platform, which covered my original team, a sibling team, and the newly formed but empty online storage team.

I poached our founding engineer from my division’s hiring pipeline because of his resume and his obsession with databases. I found out later he would never have accepted an offer for the core services area that I led, except for the pitch I made about founding our online storage team.

Hiring Mingjian Liu was a great start. Now we had one talented engineer to run the database servers for an $8 billion company and for supporting 350 software engineers… This was an improvement!

Next, he and I took alternating weeks of responsibility for level one (L1) on-call for the most reliability-critical data stores at Plaid. On our off weeks, we were L2 on-call.

At this stage, we were tamping down the bigger fires in data store reliability, things like:

  • building observability (thanks PMM!)
  • standardizing alerts
  • advanced tunings
  • Working with client teams on group commits
  • workload separation
  • dodging between outages to try to get back to our planned work

By “we”, I mean my founding engineer because I was a director for the division in wall to wall meetings and moonlighting as a software engineer.

We earned a lot of good will in fighting the great Winter of NoSQL outages and putting the org back on track. Then we took over ownership of Mysql 5.6 to 5.7 upgrades when learning how AWS Aurora blue/green deploys caused a 30 minute outage. By this time, we had two more engineers on the team, and we were building momentum.

At points when we caught our breath in the upgrades to 5.7, we debated our longer term strategy: do we follow the default path and upgrade to 8.0 or make a bet on a better future with an alternate technology?

Team Building

You go to war with the army you have, not the army you might want or wish to have at a later time. - Rumsfeld

Hiring is hard, and hiring fast is harder, but there’s a cheat code here, which is you do not need deep expertise in a given domain for the whole team: if you have good leadership and technical depth in a few members.

I made the choice that we would prioritize ownership, operational excellence, and general platform experience rather than requiring deep database expertise in our team.

This required some trade-offs… It meant that a lot of the initial design work rested on the shoulders of our founding engineer and myself. It required us to be better at scoping the effort and mentoring and coaching others to grow along the way.

It means looking back on this from 2026 that we have a crew of people in the team who were incredible contributors and grew through the process and made it possible. Their lack of database pedigree was not an impediment.

One of them stepped into the role of team tech lead. Another one stepped into the role of engineering manager for the storage team. Our attrition was very low during this project arc because of meaningful work, camaraderie and good morale.

Resource constraints drive invention

Difficulties mastered are opportunities won. - Churchill

Now we had a team and we’d faced our first few battles and survived.

With the prospect of the MySQL 8.0 upgrades, we veered hard off-road and made a calculated bet:

We can deliver this project to the organization to improve the life of every developer at Plaid and for our customers, with a 6 person team largely missing deep database expertise, in an amount of time that stood up to business scrutiny.

How’s that for ambitious and a little bit crazy?

If we took a standard approach, or if we didn’t have our team buy-in and leadership buy-in, we would have failed. I think of this category of project as a project that will default fail.

Converting “default fail” into “will succeed”

Business Constraints

To varying degrees, we nerd out about perfection when operating at the data store layer:

  • It may be perfection of read-your-writes consistency.
  • It may be perfection of snapshot isolation and phantom read .
  • It may be perfection of steady p999 latency or achieving greater than five nines of availability.

But the business doesn’t give a flying fuck about perfection and that’s the correct fiscal invariant!

The business cares about money and what drives growth. Datastores, reliability, consistency and disaster recovery are implementation details in service of the customers and shareholders.

The business wants what is good enough for the circumstance that the business operates in…And that’s what we needed to target after accounting for a safety margin.

So we set out to categorize our services and our business constraints to find the least common denominator of what would exceed their tolerances but not cost an arm and a leg to build. Then we reviewed those business targets with engineering leadership to bucket our services into criticality and archetypes. This allowed us to make key decisions once rather than re-litigating it for every service.

A tactical example here was our decision to do controlled cutovers via feature flags with 60 to 120 seconds of writer downtime when cutting from Aurora to TiDB. We spec’d out the process for how we would do zero-downtime cutovers, instead of 60-120 second cutovers. We realized that the effort and complexity to get the zero-downtime solution correct, and the limited cases in which we would allow that kind of consistency risk, was not worth the investment.

We prioritized consistency guarantees for our critical financial data, with the acceptable error budget loss of 60 to 120 seconds of downtime on the write pathway.

Paying those losses was no worse than what we would pay had we upgraded to MySQL 8.0, but they transported us to a platform where we’ve had zero downtime across three upgrades across all clusters, across 100 services in 2025.

Better every time

If you are going through hell, keep going. - Churchill Migrations suck, so we made them suck less. - Zander

Empower your team and insist that they make their process better every time they do it. Whether that’s:

  • contributing back documentation on a runbook
  • contributing back reliability risks from a collation
  • adding a tiny bit of automation to a set of Terraform changes

These trivial things compound and become superpowers.

Don’t start out by boiling the ocean on automation. Don’t bludgeon it with waterfall design until you run out of budget and get fired.

Start small. Start manual. Empower your team that every time they do a thing that is annoying and repetitive or risky, they contribute back a fix.

We started with engineering logs that were just scratch notes of all of the steps we were taking. After you do that a time or two, you know what’s repeated and you dread it.

Imagine that step takes 4 hours of human time and you’re going to have to do another 234 times over two and a half years, so you polish that into a runbook that lives in Google Docs.

Over a couple months, you have 40 of those runbooks for different steps of the process. All contributed by different members of the team and cross-reviewed for simplicity and precision. Now you need an index for your runbooks, which you call your master runbook. Every time someone starts a new service migration, they copy the master runbook, and that is their journal, their engineering journal, for ensuring that they complete every step perfectly.

Brute force copy/pasting works for the first 10 or 20 services that you’re going to migrate out of 100, but the slowness and the monotony and the mistakes creep in.

So I built a system for us called dynamic runbooks that are glorified Jupyter notebooks that run the Deno runtime in them for familiarity on our team and security stance. Those dynamic runbooks substantially replace the operational steps from your Google docs. Instead of copying and pasting, we’re now running a Jupyter Notebook for a specific cluster and service for a specific set of steps.

Pointing and clicking > Copy and pasting - Zander

We’re pointing and clicking our way to success.

It’s faster, less error prone and self-describing and cannot drift out of date. It continues to empower our engineers to contribute back to the process to automate what we’re doing. It transforms frustrations into frustration-driven developments.

These dynamic runbooks started very basic. They were nearly a copy and paste job of the Google Docs, but then we started wrapping common behavior in them, like specific commands or API executions, into an SDK and Deno that we would import and use in the runbooks. We set up an asynchronous job runner for these tasks when some of the tasks started to need to run uninterrupted for over 24 hours. That job runner also allowed us to have easy streaming log access for tasks, so that I could check on a colleague’s tasks if something was happening in our validation cluster.

Cross Functional Alignment

You need the goodwill of the organization for 2.5 years, and goodwill does not last that long. You need to fight entropy and demonstrate wins along the way while having little to no attrition.

Every service you move forward to the next platform should be able to demonstrate improvement. They may not be perfect, but you need the teams who operate those services and their managers to gradually have better experiences with the data store layer. Frankly, if you lack that, it’s not an alignment issue; it’s a failure to deliver on promises.

Building the dream, in-flight

Envisioning a future rarely fits neatly into OKRs.

Build the dream in the terms of the value that it provides to the organization as a whole, and then figure out how you can represent that with meaningful metrics. There are many parts of the project early on that will be ambiguous, but make sure you have a north star of specific pillars that you’re delivering on for how it impacts the organization as a whole. For us, it was:

  • reliability
  • Reducing KTLO
  • developer velocity

Make it simple to understand progress and report it outwards to leadership. It’s simple for your team, simple for you, and simple for leadership.

Frontload Risks

It is no use saying, ‘We are doing our best.’ You have got to succeed in doing what is necessary. - Churchill

You want to build as rapidly as safely possible toward your hardest challenge in the arc of this epic journey.

We started with a few Tier 2 services, then a Tier 1, quickly moving to our first Tier 0: a major driver for the re-platforming. While avoiding your biggest service first, tackle it within the project’s first 50%. We did our first Tier 1 within 25% and Tier 0 within 40%. With every service and every latency blip early on in our adoption, we were thinking about how we de-risk it for the Tier 0.

Frontloading risks and climbing the highest peak early improves project success. If you learn about critical flaws early in a safe way, that’s also a win.

If we can’t meet the needs of the T0, we don’t deserve to be running this platform for the organization. - Zander

Championship Team

Success is not final, failure is not fatal: it is the courage to continue that counts. - Churchill

Your team needs a prior track record of success before any business leader wants to pay for six ~FAANG company software engineers to work on a project for two and a half years. It’s expensive and it bloody well has to succeed.

As the leader, it is your job to make sure it succeeds. There is no obstacle that is an acceptable blocker. It’s your reputation on the line and your promise to the organization.

Conclusion

In summary, if you follow these six “easy” steps, you too can move 234 databases in 2.5 years with six software engineers at a high level of reliability and correctness.

We think these are the secret sauce that made it possible.

02 Dec 2024

Incremental Technical Automation

The Journey from Manual to Automated: A Pragmatic Approach

In the fast-paced world of startups and growing tech companies, there’s often pressure to automate from the beginning. However, I’ve found that the most sustainable path follows nature’s own progression: crawl => walk => run. Let me share a real-world story of how we transformed our operations in the Online Storage team at a 300-engineer company.

Starting from Ground Zero

When I founded the Online Storage team, we faced a common dilemma: balance the need for speed with building sustainable processes. Instead of jumping straight into full automation, we took a methodical approach that paid dividends in the long run.

The Evolution of Our Process:

  1. Crawl: Document Everything First, we focused on making our processes repeatable. Our humble beginning? Google Docs. We created detailed runbooks that captured not just the “what” but the “why” behind each step. These living documents became our foundation, complete with annotations about how to adapt procedures for different scenarios.

  2. Walk: Identify High-Impact Opportunities As our operations matured, patterns emerged. We began analyzing our runbooks to identify which processes were consuming the most time and being executed most frequently. This data-driven approach helped us prioritize which manual processes to automate first.

  3. Run: Dynamic Runbooks and Automation The game-changer came when we evolved our most-used runbooks into dynamic, parameterized versions. We integrated them with Terraform modules and built tooling that could automatically plan required inputs. This wasn’t just automation – it was intelligent automation that could adapt to different scenarios.

Real Impact: Database Migration Success

This approach proved invaluable during our complex database migrations from Aurora MySQL to TiDB. Our dynamic runbooks enabled our small team to execute these transitions with precision and confidence, despite each service having unique requirements and 200 steps.

Sharing with the Community

I believe in the power of open source and sharing knowledge. That’s why I designed our runbook tooling on my own time and made it available to everyone: dynamic runbooks. I excited to see how other teams adapt and improve upon this pattern.

The Key Takeaway

The path to automation doesn’t have to be an all-or-nothing approach. By starting with solid documentation, identifying high-value automation targets, and building flexible tools, teams can create sustainable processes that evolve with their needs.

I’m passionate about seeing this pattern adopted more widely in our industry. What’s your team’s approach to automation? How do you balance immediate needs with long-term sustainability?

30 Nov 2024

2024: The Rise of S3-Backed OLTP Databases

A revolutionary shift is happening in the world of Online Transaction Processing (OLTP) databases. Traditional architectures are being reimagined with a cloud-native approach that leverages Amazon S3 as the foundation for durable storage, combined with sophisticated caching layers using NVMe SSDs and memory. This architectural pattern isn’t just a minor optimization—it’s potentially the future of database design.

Why This Architecture Matters

The combination of S3 for durability and high-performance NVMe drives for caching represents a perfect marriage of reliability and speed. This approach offers several compelling advantages:

  1. Cost Efficiency: S3’s pay-for-what-you-use model eliminates the need for overprovisioning storage
  2. Unlimited Durability: S3’s eleven 9’s of durability far exceeds traditional storage solutions
  3. Separation of Concerns: Decoupling storage from compute enables independent scaling
  4. Performance: NVMe caching layers provide the low latency needed for OLTP workloads

Real-World Implementations

This architectural pattern isn’t just theoretical—several innovative database projects are already leading the charge:

WeSQL

WeSQL has built their entire architecture around S3, implementing a sophisticated persistent cache layer with NVMe drives. Their approach experiments with how modern databases can achieve both high performance and cost efficiency through intelligent caching strategies.

MotherDuck & SlateDB

These projects showcase how differential storage approaches can be implemented effectively with S3, particularly for analytical workloads that require both performance and cost-effectiveness.

TiDB Serverless

Pingcap is adopting a S3 durable storage as some of the replicaset members for their cloud offering as shared in their public blog posts.

ClickHouse

Even established players like ClickHouse are adapting, offering robust S3 integration options that demonstrate the pattern’s growing adoption.

The Technical Challenges Ahead

While the benefits are clear, implementing this architecture isn’t without its challenges. The key technical considerations include:

  1. Cache Consistency: Maintaining consistency between NVMe caches and S3 storage
  2. Failure Recovery: Handling EC2 instance failures without data loss through replication and consensus algorithms
  3. Performance Tuning: Balancing cache sizes against S3 access patterns
  4. Cost Optimization: Finding the sweet spot between cache size, S3 access frequency, and overall performance

Looking Forward

As we move into 2024, I expect this architectural pattern to gain significant traction. The combination of unlimited durability from S3 with the performance of NVMe caching layers is too compelling to ignore.

The future of OLTP databases might not just be cloud-native—it might be S3-native.


What are your thoughts on this architectural trend? Have you experimented with S3-backed databases in your organization? Share your experiences in the comments below.

24 Oct 2024

Database Best Practices

These are the best practices for storage design and reliability based on my software engineering career of over a decade and going deep on this topic the last 4 years.

  1. Write throughput is harder to scale than read throughput
  2. Improve reliability and simplify operational changes by segregating your data by Tier or Scale. Implementing this pushed our DB reliability from ~3.5x 9s to 4.5x 9s
  3. Start with Mysql (recommended) or Postgresql (acceptable) for most workloads
  4. Use hosted offerings (Aws Aurora Mysql) until the downsides are overwhelming (N-years)
  5. Hosted databases that offer 99.99% availability lie, that excludes upgrades, config changes, vertical scaling the writer, etc. Do one of those things per month and assume you have only one DB backing your service and you’re dangerously close to breaching 4x 9s. So plan on it being unable to offer > 3.5 9s.
  6. If you want a genuine 99.99 or 99.999% use DynamoDB or Global Dynamo because unlike AWS Aurora, Dynamo is a fully managed service w/o those drawbacks
  7. Control all schema changes as part of your SDLC (eg commit your migration files and have a CI job + goose or flyway to execute them)
  8. Enforce your schemas in Database (RBDMS) or in a single domain modeling service that is the only client of your database and enforce it at the serialization/deserialization boundary (Mongo) or try using their newer schema enforcement in DB. Doing otherwise will lead to higher rates of defects.
  9. Emit and record db driver metrics (clientside) and lower level DB metrics with something like PMM (serverside)
  10. Be proficient forcing failovers, you’ll need to do it occasionally in RBDMs or Mongo. Make it a CLI or automated process based on gray failure characteristics and some pre-checks.
  11. Controversial opinion: prefer Mysql > Postgresql for more advanced usage and operational advantages (see Uber’s article on switching)
  12. Reliability sources: schema changes, cluster saturation, hardware failures, query plan changes. Each has some amount of mitigation to lessen frequency or the impact.
  13. Ideal caching primitive characteristics: request coalescing (singleflight), serve stale, dynamic scaling, stores in a tiered system of RAM + NVMe, avoids cold cache problem, builtin sharding, all records have a TTL to avoid zombie records.
  14. Online Storage Platform Triad: RBDMs, NoSQL, K/V store/cache
  15. One day you’ll need sharding but in 2024, use established architectures and avoid re-inventing the wheel. You can vertically scale, then vertically partition to forestall it. Read about spanner’s origins before you decide on building your own sharding layer and consider how you’ll reshard, avoid hotspots, decide shard keys for every table, avoid fan out queries, store metadata
  16. On day 1 you’ll need connection pooling, get proficient with tuning these. Less can be more throughput.
  17. One day you may want Proxysql to handle connections and routing of traffic
  18. Alerting - use low pri alerting for things you’ll handle in current or next business hours. Use high-pri alerting to get an operater online 24/7 to intervene in < 5m from signal being emitted.
  19. Alerts must have a link to actionable runbook steps. Graduate to cli commands or dynamic runbooks followed by full automated responses.
  20. Scale as far as you can vertically and if possible use caching. Then start scaling horizontally.
  21. Metrics in aggregate will lie to you, esp in AWS dashboards on 5m aggregation of averages. You often want to see 1m and maximum or minimum to catch micro-latency spikes in EBS
  22. Use EBS for operational simplicity and durability, use NVMes for performance and better cost
  23. Plan your Business Continuity / Disaster Recovery and know your SLA/SLO/SLIs for the systems your team/division/platform runs.
  24. Datbases are some of your top reliability risks, especially when immature. Use reader instances and if you need availability be willing to give up consistency
  25. Replatforming takes months to years, so know your scaling runway (disk/architecture/throughput) and plan for 2-5yr horizon of predictable growth
  26. If everything’s on fire, get your systems ‘stable enough’ to buy time for making bigger investments
  27. Any data that’s 10x to 100x of its peer data should live in separate cluster or partition of cluster. Especialy when getting into 10s to 100s of TB.

This is a very terse list of best practices that have lifted our DB reliability as a platform from 3.5x 9s to >= 99.995 and made our oncall rotation one of the lighter ones in the company.

06 Aug 2024

2024 State of the Union

It’s 2024 and I’ve made some big changes.

After 7 years in engineering leadership, I’ve requested to shift to being a very senior engineer in my organization. During my term in engineering leadership, I:

  1. Was the CTO of a 50 person startup (successfully acquired)
  2. The senior backend director of a major media company
  3. Most importantly, I grew a ton by coming to Plaid and being a line manager in Platform, then a Lead of Leads in Platform, followed by a Tech Lead Manager of the Storage Team.

Why am I ‘holding the career ladder upside down’? It’s to extract the most mutual value… by providing value for my org while also continuing growing in a manner and direction of my choosing.

I’ve gotten to the point where leading teams is no longer interesting and challenging, but building technology is interesting and challenging with plenty of room to grow.

I shifted from being the TLM of Storage, to a senior staff member of Storage and will find my next focus internally once we complete our ambitious projects of 2024.

In the meantime, I’m spending my time automating our processes and helping the team move faster. The major leverage there has been in designing a system of dynamic runbooks to execute the 200+ steps needed to move a service to a new online storage platform with ~90s of write traffic interruption.

In smaller news, I’m having fun coding and building again!

In support of my team:

  1. I forked and added support for Resource Groups in TiDB to the mysql terraform provider which we’ll upstream when it has baked in.
  2. I figured out the type exchange of FFI in deno so that I could create a library for exec’ing processes exec.
  3. I re-built a favorite tool of mine as tome-cli after using tome for a few years in production.
  4. I maintain my own package management system based off of cash-app’s hermit which lives at packages.
  5. I’m contributing back to upstream projects to patch bugs or improve usability (tiup & dagu)
  6. I’ll be presenting about TiDB at HTAP Summit in Sept 2024.

06 Aug 2024

Improved e2e testing by replacing bats-core with deno+dax

When writing tome-cli, I wanted to ensure that the tool’s side effects were accurately tested at their outermost bound and so I needed to wrap the full tool’s binary execution in end to end tests.

Normally I would use bats-core for bash/cli testing. But I’ve been using deno more and more for work and find dax library (like xz from Google) to be a simple and powerful mechanism for shelling out from a more robust language.

I simplified my testing interface by wrapping the tome-cli command in a deno function, injected the necessary environmental variables, and pre-parse the outputs to make each test low repetition.

Function setup: https://github.com/zph/tome-cli/blob/main/test/tome-cli.test.ts#L27-L31

Example test: https://github.com/zph/tome-cli/blob/main/test/tome-cli.test.ts#L100-L108

This approach makes for easier E2E testing compared to bats because I have robust data types without the fragility/peculiarities of bash, while having a clearer UI for assertions and their diffs.

04 May 2024

DBOps Automating infrastructure operations

(sketching out requirements for a database infrastructure automation tool. This resulted in writing a more generic prototype at https://github.com/zph/capstan which I prototyped on a local mongo cluster and achieved a proper upgrade procedure without human action beyond confirmations at critical steps)

DBOps

Consists of an event loop that observes a system for a desired state.

Phase 1

Desired transitions are registered in the system and consist of:

class Action {
	preChecks [() => {}]
	postChecks [() => {}]
	fn () => {}
	name () => {}
	procedure () => {}
}

Transitions can be composed into an Operation

class Operation {
	preChecks [(): boolean => {}]
	postChecks [(): boolean => {}]
	action [Action]
}

At this phase, humans register an operation to run and before each action’s fn is run, the operator is notified for confirmation to run the action.

This can be done as posing a prompt for the human in terminal, slack, etc such as:

@bot> Ready to run Action: ${action.procedure()}
@bot>> Confirm by responding: @bot run XXX-YYY-ZZZ

Phase 2

During this phase, the system determines the changes necessary by knowing the desired state and checking if the world matches the desired state.

class State {
	fetch: () => () // get the world's state
	check: // see if fetched state matches desired state
	plan: // recommend the changes needed, Action[]
	apply: // internal function to run the Action[]
	rollback: // undo apply
}

TODO: look into the saga pattern for inspiration on actions and rollbacks

https://github.com/SlavaPanevskiy/node-sagas/blob/master/src/step.ts https://github.com/SlavaPanevskiy/node-sagas?tab=readme-ov-file#example

https://github.com/rilder-almeida/sagas

https://github.com/temporalio/temporal-compensating-transactions/tree/main/typescript

22 Jul 2023

NLB Target Handling During pre-change and post-change Hooks

I’ve been using a tool lately that provides good defaults for performing complex database operations, but found a few cases where we’d need to contribute upstream, fork the tool, or determine a generic way to extend it for our purposes.

There are a few ways to go here:

  1. Golang plugins (or hashicorp/go-plugin)
  2. Pre/post hooks for arbitrary shell scripts in the cli tool
  3. Extend the cli.

My choice has been to do 3 in cases of shared utility for other users, 2 in simple cases and 1 for complex interactions or complex data.

01 Jul 2023

Use presigned AWS STS get-caller-identity for authentication

Introduction

I’m researching passing verified identity of an AWS user or role for a service and came across an approach that solves it using AWS STS get-caller-identity paired with presigned urls. I found this article about the technique by Bobby Donchev in AWS Lambda invoker identification

During this research, I discovered that Hashicorp Vault and AWS IAM Authenticator experienced a security vulnerability due to this pattern. In this post I summarize the underlying approach and the mitigations that Google’s Project Zero describe.

Use Case

Allow an AWS Lambda to verify the role of the invoker in order to execute commands that depend on knowing the role.

The technique is to presign an STS get-caller-identity API call on the clientside and send that presigned link to the Lambda. The lambda executes the presigned link via an HTTP GET request and validates the output which feeds into additional internal logic.

This technique is used in:

  1. Hashicorp’s Vault
  2. AWS Lambda Invoker Identification by Donchev
  3. AWS IAM Authenticator in Kubernetes

Security Issues

I found documented and addressed security issues in the Github Tracker for AWS IAM Authenticator and the Google Project Zero post describing vulnerabilities in Hashicorp’s Vault product.

Hashicorp Vault’s Issues

The security problems described are:

  • Golang’s surprising xml decoding behavior
    • Mitigation: require application/json
  • Attacker supplied STS domain component of URL can be spoofed
    • Mitigation: use known good STS endpoints concatenated with with presigned payload
  • Attacker can spoof Host header
    • Mitigation: allow-list certain headers and maintain control of what headers are used for the upstream GET
  • Caller can presign various STS actions
    • Mitigation: validate that action is GetCallerIdentity

The fixes for Vault were allowlist of HTTP headers, restricting requests to the GetCallerIdentity action and stronger validation of the STS response ref

AWS IAM Authenticator Issues

For aws-iam-authenticator the issues discovered were:

  • Regex for host is too lax
    • Mitigation: strict set math of known endpoints in regions
  • HTTP Client allows redirects
    • Mitigation: Disallow redirects in client
  • URL.Query() vs ParseQuery: silent drop of invalid params rather than erroring
    • Mitigation: use ParseQuery
  • Request smuggling in Golang < 1.12
    • Mitigation: Build with Golang >= 1.12

Conclusion

When I prototype a service for using STS get-identity-caller via pre-signed links, I’ll keep in mind these security concerns which boil down to following security principles:

  1. Distrust user content
  2. Perform strict validations
  3. Understand and limit behavior of libraries that could expose a wider surface area of attack

With knowing about the existing security constraints, both in specific and in principles involved, I’m confident about the ability to build a system that uses STS presigned get-identity-caller requests safely and pair it with an AWS lambda which has a second layer of IAM defenses for allow-listing a subset of ARN based invokers.

10 May 2023

On Reliability

I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.

The takeaways are:

  1. Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
  2. Problems come from the the service itself or its critical dependencies.
  3. Availability = MTTF/(MTTF+MTTR)
  4. Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path.
  5. When depending on services not meeting that threshold, it must be accounted for via resilient design
  6. The math:
  • Assuming a service has error budget of 0.01%.
  • Choose 0.005% error budget for service and the 0.005% for critical dependencies, then 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
  • If calculating aggregate availability: 1x service and 2x dependency services of 99.99% yield 99.99 * 99.99 * 99.99 = 99.97% availability. This can be adjusted by improving availability through redundancy or removing hard dependencies.
  • Frequency * Detection * Recovery = the impact of outages and the feasibility of a given SLA.
  1. Consequently, the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.

I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind. The tradeoff here is complexity and a rule of thumb that each additional 9 costs 10x the effort/cost.

We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.

Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).