DBOps Automating infrastructure operations

(sketching out requirements for a database infrastructure automation tool. This resulted in writing a more generic prototype at https://github.com/zph/capstan which I prototyped on a local mongo cluster and achieved a proper upgrade procedure without human action beyond confirmations at critical steps)

DBOps

Consists of an event loop that observes a system for a desired state.

Phase 1

Desired transitions are registered in the system and consist of:

class Action {
	preChecks [() => {}]
	postChecks [() => {}]
	fn () => {}
	name () => {}
	procedure () => {}
}

Transitions can be composed into an Operation

class Operation {
	preChecks [(): boolean => {}]
	postChecks [(): boolean => {}]
	action [Action]
}

At this phase, humans register an operation to run and before each action’s fn is run, the operator is notified for confirmation to run the action.

This can be done as posing a prompt for the human in terminal, slack, etc such as:

@bot> Ready to run Action: ${action.procedure()}
@bot>> Confirm by responding: @bot run XXX-YYY-ZZZ

Phase 2

During this phase, the system determines the changes necessary by knowing the desired state and checking if the world matches the desired state.

class State {
	fetch: () => () // get the world's state
	check: // see if fetched state matches desired state
	plan: // recommend the changes needed, Action[]
	apply: // internal function to run the Action[]
	rollback: // undo apply
}

TODO: look into the saga pattern for inspiration on actions and rollbacks

https://github.com/SlavaPanevskiy/node-sagas/blob/master/src/step.ts https://github.com/SlavaPanevskiy/node-sagas?tab=readme-ov-file#example

https://github.com/rilder-almeida/sagas

https://github.com/temporalio/temporal-compensating-transactions/tree/main/typescript

NLB Target Handling During pre-change and post-change Hooks

micro

I’ve been using a tool lately that provides good defaults for performing complex database operations, but found a few cases where we’d need to contribute upstream, fork the tool, or determine a generic way to extend it for our purposes.

There are a few ways to go here:

Golang plugins (or hashicorp/go-plugin)
Pre/post hooks for arbitrary shell scripts in the cli tool
Extend the cli.

My choice has been to do 3 in cases of shared utility for other users, 2 in simple cases and 1 for complex interactions or complex data.

Use presigned AWS STS get-caller-identity for authentication

lambda security aws-sts

Introduction

I’m researching passing verified identity of an AWS user or role for a service and came across an approach that solves it using AWS STS get-caller-identity paired with presigned urls. I found this article about the technique by Bobby Donchev in AWS Lambda invoker identification

During this research, I discovered that Hashicorp Vault and AWS IAM Authenticator experienced a security vulnerability due to this pattern. In this post I summarize the underlying approach and the mitigations that Google’s Project Zero describe.

Use Case

Allow an AWS Lambda to verify the role of the invoker in order to execute commands that depend on knowing the role.

The technique is to presign an STS get-caller-identity API call on the clientside and send that presigned link to the Lambda. The lambda executes the presigned link via an HTTP GET request and validates the output which feeds into additional internal logic.

This technique is used in:

Hashicorp’s Vault
AWS Lambda Invoker Identification by Donchev
AWS IAM Authenticator in Kubernetes

Security Issues

I found documented and addressed security issues in the Github Tracker for AWS IAM Authenticator and the Google Project Zero post describing vulnerabilities in Hashicorp’s Vault product.

Hashicorp Vault’s Issues

The security problems described are:

Golang’s surprising xml decoding behavior
- Mitigation: require application/json
Attacker supplied STS domain component of URL can be spoofed
- Mitigation: use known good STS endpoints concatenated with with presigned payload
Attacker can spoof Host header
- Mitigation: allow-list certain headers and maintain control of what headers are used for the upstream GET
Caller can presign various STS actions
- Mitigation: validate that action is GetCallerIdentity

The fixes for Vault were allowlist of HTTP headers, restricting requests to the GetCallerIdentity action and stronger validation of the STS response ref

AWS IAM Authenticator Issues

For aws-iam-authenticator the issues discovered were:

Regex for host is too lax
- Mitigation: strict set math of known endpoints in regions
HTTP Client allows redirects
- Mitigation: Disallow redirects in client
URL.Query() vs ParseQuery: silent drop of invalid params rather than erroring
- Mitigation: use ParseQuery
Request smuggling in Golang < 1.12
- Mitigation: Build with Golang >= 1.12

Conclusion

When I prototype a service for using STS get-identity-caller via pre-signed links, I’ll keep in mind these security concerns which boil down to following security principles:

Distrust user content
Perform strict validations
Understand and limit behavior of libraries that could expose a wider surface area of attack

With knowing about the existing security constraints, both in specific and in principles involved, I’m confident about the ability to build a system that uses STS presigned get-identity-caller requests safely and pair it with an AWS lambda which has a second layer of IAM defenses for allow-listing a subset of ARN based invokers.

On Reliability

micro engineering excellence principles

I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.

The takeaways are:

Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
Problems come from the the service itself or its critical dependencies.
Availability = MTTF/(MTTF+MTTR)
Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path.
When depending on services not meeting that threshold, it must be accounted for via resilient design
The math:

Assuming a service has error budget of 0.01%.
Choose 0.005% error budget for service and the 0.005% for critical dependencies, then 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
If calculating aggregate availability: 1x service and 2x dependency services of 99.99% yield 99.99 * 99.99 * 99.99 = 99.97% availability. This can be adjusted by improving availability through redundancy or removing hard dependencies.
Frequency * Detection * Recovery = the impact of outages and the feasibility of a given SLA.

Consequently, the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.

I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind. The tradeoff here is complexity and a rule of thumb that each additional 9 costs 10x the effort/cost.

We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.

Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).

Deno for Shell Scripting a Pagerduty helper

micro

Deno, a modern JS platform, worked well tonight as a scripting language and ecosystem for building a tiny cli interface for paging in teammates.

I will expand my use of it and replace usage of zx + js with vl + ts + Deno.

I prototyped the cli with a focus on reducing human cognitive overhead during stressful operations to provide sound defaults and reduced choice when choosing how many teammates to bring in, what to use as incident message, and details.

A few things made this easy:

Confidence in deploying this as a standalone tool thanks to Deno without dependency management issues
npx allowing for lazy loading of the pagerduty-cli tool to shell out to
vl as a deno library that emulates behavior from Google’s SRE tooling of zx. These are javascript libraries that make for convenient bash-like scripting in a js/ts environment.

Prototype script below:

Replacing inlined scripts with bundler inline

micro

git smart-pull is a great tool for avoiding the messiness of git rebases when there’s changed content.

I long ago inlined the full ruby gem into a single executable file to avoid the hassle of installing it in various ruby environments. It’s worked well!

Ruby 3.2.0 broke my script in a tiny way and broke the underlying gem. The patch has sat unmerged for months and now with bundler/inline I have a better solution than keeping a spare inlined script… forking the project and pointing a wrapper script with bundler/inline at my own repo.

I’m applying patches from PRs in upstream (eg hub merge https://github.com/geelen/git-smart/pull/25).

It’s an elegant solution that I’ll re-use for other ruby scripts in my development environment.

Automatically Warm EBS Volumes from Snapshots

micro ebs

We’re automating more of our cluster operations at work and here’s the procedure for warming an EBS volume created from a snapshot to avoid their first-read/write performance issues.

How it works

Follow instructions in gist for installing in root’s crontab as a @reboot run instruction. It uses an exclusive flock and a completion file to ensure idempotency.

3x Faster Mongodb Controlled Failovers

micro mongodb ops

I recently modified our failover protocol at work for MongoDB in a way that reduces the interruption from 14 seconds down to 3.5 seconds by altering election configurations ahead of controlled failovers. This was tested on a 3.4 cluster but should hold true up until modern versions. Starting in 4.0.2 it’s less valuable for controlled failovers but still useful as a tuning setting for uncontrolled failovers.

How it works

The premise is to make the shard call a new election as fast as possible by reducing electionTimeoutMillis and heartbeatIntervalMillis.

Procedure:

// on the primary
cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 1000
cfg.settings["heartbeatIntervalMillis"] = 100
rs.reconfig(cfg)

// wait 60 seconds for propagation
rs.stepDown()

// wait for 60 seconds for election to settle
// connect to primary

cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 10000
cfg.settings["heartbeatIntervalMillis"] = 1000
rs.reconfig(cfg)

This is valuable to tune also if you’re on high quality, low latency networks. You’re missing faster failovers in non-controlled circumstances every time mongo insists on waiting 10 seconds before allowing an election, even when receiving failing heartbeats.

PS - While browsing the docs I found this ^_^ which is non-intuitive since I would expect no writes to one shard but no impact to other shards. Presumably it’s a typo and cluster means replicaset.

During the election process, the cluster cannot accept write operations until it elects the new primary.

Use GEM_HOME for bundler inline

micro

Ruby’s bundler inline installs to --system destination and does not respect BUNDLE_PATH (as I would expect).

Errors will be about permission errors and server requesting root access for the bundler install.

Digging around in github issues, this is desired behavior: https://github.com/rubygems/bundler/pull/7154

Solution:

export GEM_HOME=vendor/bundle

Cost Optimizations

It’s 2022 and the macroeconomic environment is severely correcting from the easy days of cheap capital.

In this environment it’s wise on a personal and business level to consider expenditures and ensure they’re providing value. I’ve been through this with personal subscriptions and with family financial planning.

Today I spent a day off work helping a friend and colleague from a former startup in optimizing his company’s tech infrastructure spend. His business is a web application with a valuable service but low traffic levels ( < 10 RPS ) and running on Heroku.

Within the first 30 minutes of him screensharing and describing the business behavior of the app, I was able to recommend $200/mo (25%) savings on his plan.

With another 2 hours, I had a recommendation for how to save another $450, which is a great savings for an indie lifestyle business. That will save him $7,800 per year and reduce his Heroku bill from ~$800 down to $150 per month.

In debugging that, I discovered how he can also trim 20 seconds (ie 75%) off of a critical feature which will translate to increased conversion.

It was a fun problem to solve and fits well into my interest in either increasing business through technology or increasing efficiency and profit through technology. I’ve done similar projects on database and server spend which yields amazing results at scale (7 figures per year), on backend and frontend performance in small and medium size companies, and take pleasure in using my expertise at the intersection of technology and business.

I’ll test the waters to see if there’s demand for consulting to help companies outsource these efforts and reduce their AWS / Heroku / NewRelic / Datadog / etc bills.

The timing is right for this adventure.

xargs.io

All the IO and Multiplexing

04 May 2024

Phase 1

Phase 2

22 Jul 2023

01 Jul 2023

Introduction

Use Case

Security Issues

Hashicorp Vault’s Issues

AWS IAM Authenticator Issues

Conclusion

10 May 2023

12 Apr 2023

07 Apr 2023

06 Apr 2023

How it works

06 Apr 2023

How it works

28 Jan 2023

22 Nov 2022