09 Apr 2021

Speed up Docker for Golang and Node

A Story of Ludicrous Speedups (3000 sec -> 5 sec)

During a work hackathon, our project involved using Docker for deployment and dependency management.

The dockerfile was inherited from an underlying open source project and was ok when used for deployments but very slow for local development work. Why, you might ask?

It used multi-stage builds, one for node, one for golang and then a final stage that collected the built artifacts from the prior stages. But the problem was…

Conflating package installation with project build

The Dockerfile failed to use a best practice of first copying over the package manifest. For Node these are package.json & yarn.lock. For Golang it’s go.mod and go.sum.

Instead of copying over these specific files up front, the Dockerfile copied the full project into the container then performing a build.

The problem

Since the local copied source code changed frequently during development, all later steps in the Dockerfile were invalidated and performed without caching :(. Downloading all golang dependencies and compiling from scratch was onerous.

The solution

Break apart the dependency installation phase from the local code phase. Package manifests should be copied in first, then yarn install will install Node dependencies. I had to get hacky to accomplish the same thing with Golang, but I’ll post my solution when I have a good moment.

Conceptually, the outcome was:

  1. Build phase a1: Copy in package manifests for Node & yarn install
  2. Build phase b: Copy in package manifests & fetch golang deps
  3. Build phase a2 (built on a1): Build local code for js/ts
  4. Build phase b2 (built on b1): Build local code for golang
  5. Build phase c (independent of a or b) Selectively copy build artifacts from a2 and b2.

Outcome

Dev build time for docker image is now near instant (5 seconds) rather than 2900 seconds on a low power laptop.

Bonus

We also created a dedicated Dockerfile.dev that excluded js production build logic which was accounting for 300+ seconds of build time. Instead the js was built with a development script enabling hot module reloading.

04 Apr 2021

Using yubikey for SSH

Using yubikeys everywhere is my jam…here’s how.

Setup

I did it by installing yubikey-agent with a:

brew install yubikey-agent
brew services start yubikey-agent

Then shell configuration in ~/.zshrc:

export SSH_AUTH_SOCK="/usr/local/var/run/yubikey-agent.sock"

For each yubikey

  • Create an 8 char password in password manager
  • Run yubikey-agent -setup
    • Enter PIN/PUK
  • Get the public key and verify it works with ssh-add -L
  • Record public key in password manager and use the Yubico id to disambiguate which yubikey
  • Add the public key to anywhere relevant, ie https://github.com/settings/keys

29 Mar 2021

Note Taking System

I’m trying out a new note taking system with the following goals:

  • Easy to draft
  • Searchable
  • Similar syntax/shortcuts to other workflows (markdown)

It’s based on the concept of Zettelkasten and I’m stitching together my own system using:

25 Mar 2021

Articles To Write
  • Transitioning a Startup from MongoDB to Postgres
  • Engineering Leadership with Remote Teams
  • Lessons as an Engineering Leader
  • Emotional Labor of Management
  • Productive Engineering and Management Culture
  • The Value of Long Form Writing and Thinking Time
  • Elixir Reflections: 2013 to Now
  • My Development Environment
  • Notetaking 2021
  • Startup Horror Stories

24 Mar 2021

Long Absence and New Excitement

My blog has been dormant, where have I been?

Busy! With my professional life and personal life. Big and good things on both fronts.

Since I last wrote:

  • I led a 50 person startup in manufacturing as the CTO using Elixir and Typescript
  • I led the development of a SaaS product as their Sr. Director of Engineering (Python, Elixir, Typescript, Graphql, and REST)
  • I built and ran an open source app in production that streams mutations in MongoDB to Postgresql called Moresql
  • We used Moresql to transition a company from MongoDB to Postgresql.
  • I spent a lot of time with my cats
  • I refined my bash coding style, evidence in my dotfiles
  • I’ve lived in 3 distinctly different places and I’ve gotten to do some traveling
  • I’ve been fully remote in remote and colocated companies since 2013

I haven’t written about my engineering leadership experiences because it’s hard to sufficiently abstract them in the moment in a way that I can publicly write about. I’ll see if that changes in the future because I want to be able to share learnings, since my professional growth is so very important to me.

I recently put up two pull requests for a commandline tool written in Rust called Tome. Tome is a rust binary tool that helps make a folder of scripts re-usable for one or more people so that it has good user ergonomics. My pull requests are https://github.com/toumorokoshi/tome/pull/4 and https://github.com/toumorokoshi/tome/pull/5.

With three days of Rust, my impressions are as follows:

  • The compiler teaches me the language, through useful error messages
  • Some types are obvious (Option/Result types) while others (String vs str and borrowed version) need me to investigate to understand
  • Cross compiling isn’t as user friendly as golang but is working now that I took time to set it up
  • All my favorite commandline tools are moving into Rust, perhaps I should too!

If you see me and want good conversation starters:

  • Ask me about using Trello as a columnular data store
  • About moving from MongoDB to Postgres (thanks JSONB)
  • About what my best purchases were during covid

In 2021, I’m looking forward to:

  • Learning and growing technically (Rust?)
  • Getting good at a few things outside software engineering.
  • Spending more time outdoors.
  • Traveling to see friends.

It’s 2021, I’m happy and have challenges to work on :).

07 Mar 2017

Mp4s To Gif From Twitter

Visit the link, Cmd-Option-J in Chrome to open DevTools.

Execute and capture result of $('#playerContainer').data().config.video_url

Take that link and enter it in https://cloudconvert.com/ and select “Select Files” dropdown, enter url.

Or, use a commandline tool:

Install pre-requisites

pip install cloudconvert requests

Download script: https://gist.github.com/c83c3e91ee3f9df21686bb50b4fbf904

Make it executable: chmod +x twitter-gif

Run it: twitter-gif TWEET-LINK outputfilename-optional

12 Feb 2017

Solving Infinite Loop In NPM With Dtruss

Last week one of the engineering juniors that I mentor ran into a strange environmental issue.

When he ran npm run karma it would run for ~8 minutes and then suddenly spit out an out of memory error. He tried debugging it for awhile himself and then reached out to me to assist.

We ran through the normal set of troubleshooting steps:

  • Verify NPM and Node are on versions appropriately matched to production. (They were newer so we re-installed the ones used in prod)
  • rm -rf node_modules/ followed by npm install. (This semi-frequently resolves issues when old dependencies are not cleared out)

And when we tried running the offending command again, we suffered the error once more.

Which was when I reached into my bag of tricks and thought back on articles by @b0rk and @brendangregg. I remembered tutorials about using Dtrace to track down system calls from particular process identifiers. And I remembered a similar tool called DTruss that allows for attaching to PID and observing the system calls. For more info on DTruss, go check it out here: http://www.brendangregg.com/DTrace/dtruss or by vim $(which dtruss).

So I explained the barebones that I knew about how DTruss operates and we fired up dtruss npm run karma.

We had time to talk a bit about system calls and the meaning of the readout. After 2 minutes we noticed that the log continued to fly by but the same folder was being accessed. Over and over and over. We had a recursive dependency due to an out-dated library that was stored inside the project tree.

Thanks to DTruss, we realized the issue, wiped out the offending folder and tried again with success!

PS - While writing this article I learned that Brendan Gregg wrote DTruss. Many thanks both for DTruss and for writing articles about how to use these tools! I also owe a thanks to Julia Evans who exposed me to these tools through her blogging and Zines :).

01 Feb 2017

Thoughts On Gitlab Data Incident

Background

On Feb 1st, Gitlab suffered a irrecoverable data loss for a period of 6 hours.

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

(In case that link goes stale, here’s a copy: https://gist.github.com/8b9449ec4260583d0e644c7cdc94f3be)

My first thought is that it’s a horrible experience both for the users who lost data and for the engineers involved in the process at Gitlab. The feelings of anger, self doubt and frustration are hard to bear. I wish them all the best in recovering and getting back to work. My heart goes out to them for this experience.

After being floored by the possibility of permanent data loss, my thoughts went next to consider how their experience could inform my team’s decisions with regard to our own processes.

None of this is intended to as backseat driving the situation Gitlab suffered. It is intended as constructive discussion of systems failing to discourage human error, of which we are all susceptible.

Summary of Events

The tl;dr was PG replica got behind. Engineer1 went into debugging after their shift was over. Then the engineer believed they were SSH’d into the replica, but were really SSH’d into the primary. At this point Engineer1 tried to run a command to start replication. They had trouble with command and assumed they needed to wipe out the data directory fully where postgres stores databases. They ran a variant of “rm -rf” and removed the 300GB of data. Engineer1 realized the issue and stopped the deletion when only a few gigabytes remained. The data was unrecoverable from data directory. At this point Engineer1 handed off the baton due to realizing the mistake and already being heavily fatigued.

Their 5 backup systems all failed them. Their latest mostly complete backup was 6 hrs out of sync. Their webhooks data is lost or 24 hrs out of sync.

Repeating that… all 5 backups failed! That is a very very worst case.

That said, their data from 24 hrs ago seemed like valid backups and their backup from 6 hours before was valid. That means backup system 6 and 7 were working decently.

Ways to Limit Risk in Future

My takeaways from their incident:

  • Check your backup system works the way you think it does. Ideally this means occasional automated and manual occasions when backups are loaded into system and verified.
  • Use buddy system when doing potentially dangerous things on production.
  • This would lessen the likely of executing commands while SSH’d into wrong box
  • Talk through actions before doing them when on production. Have team mate confirm each step.
  • Take an airline pilot checklist approach to these situations to fend off some of the avoidable mistakes.
  • Do not make big decisions under time crunch. The engineer was trying to leave at end of shift w/ hard stop timeline. They were rushed and stressed. Having replication lag way longer and handing off to other person could have offset the much worse disaster that they induced. Twelve hours of partially degraded service might be worthwhile trade instead of a complete loss of 6 hr of data.
  • Tiredness leads to mistakes. Tap out and hand off the baton.
  • Take a backup manually before operating like this on production systems. A 5 minute operate of streaming exporting via pg_dump to AWS S3 would help narrow the window from 6 hr loss to minutes or zero time (assuming app was in full maintenance mode during database replication). I take advantage of this technique before doing potentially destructive database actions. Create a full db snapshot if it’s a db level change or a table level snapshot if limited to single table. Commit your action, validate findings, and then wipe out the snapshots if space is precious.

Conclusions

Humans make mistakes when working with complicated systems. Well designed systems and policies help put safeguards in place to reduce the likelihood of irrecoverable & disasterous events.

I anticipate that the engineering team is working on a clear blameless post-mortem to bring closure to this event.If you’re unfamiliar with blameless post-mortems, check out this article by John Allspaw: https://codeascraft.com/2012/05/22/blameless-postmortems/. During the post-mortem they’ll identify the actions taken and circumstances of the incident along with systems and protocols that can be improved to make these circumstances likely to recur.

PS - I went and checked our various backups for production systems after this event. The hourly, daily, weekly, monthly backups are in good order for Mongo, Postgres and Redis. The automated backups of Redshift look good, as do the manual checkpoints from before major changes. The S3 copies that are permanently stored for varying durations for Mongo are in good shape as well. The realtime replication of Mongo to Postgres is in good shape and has preserved us from data loss when an incident occured. I’ll be ever nervous about data loss, but I think we’re in generally good shape.

31 Jan 2017

Implementing Bayeux Client In Golang

Announcing a golang client for Bayeux (long polling): https://github.com/zph/bayeux.

I recently found myself needing to integrate Salesforce data into a production system. Which gave me the opportunity to implement a client for Bayeux protocol based on the Salesforce docs, Stack Overflow undocumented features, a rough python implementation from Github’s Gists, and Faye Ruby gem.

The protocol enables a client to subscribe for realtime updates based on a predetermined query using Salesforce’s SQL type language.

For the small number of realtime queries supported by Salesforce API, this works wonderfully.

Usage example:

package main

import (
	"fmt"

	bay "github.com/zph/bayeux"
)

func main() {
	b := bay.Bayeux{}
	creds := bay.GetSalesforceCredentials()
	c := b.TopicToChannel(creds, "topicName")
	for {
		select {
		case e := <-c:
			fmt.Printf("TriggerEvent Received: %+v", e)
		}
	}
}

Check out the library here: https://github.com/zph/bayeux

20 Jan 2017

Announcing MoreSQL (Realtime Mongo -> PG streaming)

I’m proud to announce that MoreSQL is live and production ready :)!

We’ve been using it in production for a few months to stream production data mutations from Mongo to PostgreSQL. The latency is normally subsecond and has scaled well with a small footprint in the two production use cases.

Background

It’s written in Golang and conceived of based on a Ruby project called mosql (built by the wonderful folks over at Stripe). After using that in production our telemetry revealed that it was lagging behind during mutation intensive periods. We were further stymied from upgrading MongoDB due to version incompatibilities in MoSQL.

Implementing the project from the ground up in Golang allowed for better concurrency, lower latency, and a small memory footprint (often 20-50MB under load).

License

Moresql is released under a permissive license and is open source software. You’re welcome to set it up in production environments and submit pull requests to improve the project.

Consulting

If you’d like a turnkey solution implemented for your business in order to have realtime data from Mongo sent to Postgres for reporting, analytics, query performance or as a mongo to postgres migration strategy, send me a note. As the author of Moresql and with running this solution in production on different systems, I’m ready to solve your problem!

Update

We used this project in production from 2017-2019 for our primary data storage and using MoreSQL was a critical component of moving off of MongoDB.